Prev: [PATCH] x86: Fix vsyscall on gcc 4.5 with -Os
Next: [PATCH 10/12] scsi: megaraid_sas - Add input parameter for max_sectors
From: Oleg Nesterov on 29 Jun 2010 09:10 On 06/28, Paul E. McKenney wrote: > > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote: > > On 06/24, Paul E. McKenney wrote: > > > > > > So it is OK to skip some of the other threads in this case, even > > > though they were present throughout the whole procedure? > > > > I think, yes. We can miss them in any case, they can go away before > > while_each_thread(g, t) starts the scan. > > > > If g == group_leader (old or new), then we should notice this thread > > at least. > > > > Otherwise we can miss them all, with or without next_thread_careful(). > > Just to be sure that we are actually talking about the same scenario... > > Suppose that a task group is lead by 2908 and has member 2909, 2910, > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some > other task is "ls"ing the relevant /proc entry. Is it really OK for > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912 > were alive and kicking the entire time? Confused. Let's return to do printk("%d\n", t->pid); while_each_thread(g, t); for the moment. In that case, if g != 2910 (the exiting thread) we will print all pids, except we can miss 2910. With or without next_thread_careful(). Only if we start at g == 2910, then current code: print 2910, then spin forever printing other pids next_thread_careful: stop printing when we notice that 2910 was unhashed. So, yes, in this case we can miss all other threads. As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated, it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread() logic. Let's assume that proc_task_fill_cache() never fails. proc_task_readdir() always starts at the group_leader, 2908. So, with or without next_thread_careful() we can only miss the exiting 2910. But (again, unless I missed something) the current code can race with exec, and s/next_thread/next_thread_careful/ in first_tid() can fix the race. (just in case, we can fix it differently). But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task" you can miss _all_ threads if 2910 exits before proc_task_readdir() finds its leader, 2908. Again, this is with or without next_thread_careful(). Paul, please let me know if I misunderstood your concerns, or if I missed something. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Paul E. McKenney on 29 Jun 2010 12:10 On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote: > On 06/28, Paul E. McKenney wrote: > > > > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote: > > > On 06/24, Paul E. McKenney wrote: > > > > > > > > So it is OK to skip some of the other threads in this case, even > > > > though they were present throughout the whole procedure? > > > > > > I think, yes. We can miss them in any case, they can go away before > > > while_each_thread(g, t) starts the scan. > > > > > > If g == group_leader (old or new), then we should notice this thread > > > at least. > > > > > > Otherwise we can miss them all, with or without next_thread_careful(). > > > > Just to be sure that we are actually talking about the same scenario... > > > > Suppose that a task group is lead by 2908 and has member 2909, 2910, > > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some > > other task is "ls"ing the relevant /proc entry. Is it really OK for > > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912 > > were alive and kicking the entire time? > > Confused. > > Let's return to > > do > printk("%d\n", t->pid); > while_each_thread(g, t); > > for the moment. > > In that case, if g != 2910 (the exiting thread) we will print all pids, > except we can miss 2910. With or without next_thread_careful(). > > Only if we start at g == 2910, then > > current code: print 2910, then spin forever printing > other pids > > next_thread_careful: stop printing when we notice that 2910 > was unhashed. > > So, yes, in this case we can miss all > other threads. > > As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated, > it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread() > logic. Let's assume that proc_task_fill_cache() never fails. > > proc_task_readdir() always starts at the group_leader, 2908. So, with or > without next_thread_careful() we can only miss the exiting 2910. > > But (again, unless I missed something) the current code can race with exec, > and s/next_thread/next_thread_careful/ in first_tid() can fix the race. > (just in case, we can fix it differently). > > But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task" > you can miss _all_ threads if 2910 exits before proc_task_readdir() finds > its leader, 2908. Again, this is with or without next_thread_careful(). > > > Paul, please let me know if I misunderstood your concerns, or if I missed > something. Thank you very much for laying this out completely! I was having a hard time believing that it was OK to miss threads in the "ls /proc/2910/task" case. But of course similar issues can arise when running "ls" on a directory with lots of files that are coming and going quickly in the meantime, I guess. And if proc_task_fill_cache() fails, we can miss tasks as well, correct? Given all this, I believe that your fix really does work. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Oleg Nesterov on 29 Jun 2010 14:00 On 06/29, Paul E. McKenney wrote: > > On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote: > > > > Paul, please let me know if I misunderstood your concerns, or if I missed > > something. > > Thank you very much for laying this out completely! I was having a hard > time believing that it was OK to miss threads in the "ls /proc/2910/task" > case. But of course similar issues can arise when running "ls" on a > directory with lots of files that are coming and going quickly in the > meantime, I guess. Yes. And again, even if 2910 is not the group leader and it is exiting, "ls /proc/2910/task" will work because proc_task_readdir() akways starts at 2910->group_leader == 2008. It doesn't work only if proc_task_readdir() can't find its leader, in this particular case this just means 2910 no longer exists, and thus /proc/2910/ is dead even if we can still find this dentry. > And if proc_task_fill_cache() fails, we can miss > tasks as well, correct? Well, yes and no. Sure, if proc_task_fill_cache() fails we didn't reported all threads. But if /bin/ls does readdir() again after that, proc_task_readdir() tries to contunue starting from the last-pid-we-failed-to-report. If there is no task with that pid, we start from the group_leader and skip the number-of-already-reported-threads. So, we have a lot of issues here, we can miss some thread because "skip the number-of-already-reported-threads" can't be really accurate. But, to clarify, this has almost nothing to do with the original problem. Afaics, if we change first_tid() to use next_thread_careful() instead of next_thread(), we close the pure-theoretical race with exec but that is all. (and I am still not sure this race does exist, and even if it does we can fix it without next_thread_careful). > Given all this, I believe that your fix really does work. Great. I'll send the patch once I inspect zap_threads() and current_is_single_threaded() to figure out which changes they need. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Roland McGrath on 8 Jul 2010 20:10 > We can't do call_rcu(release_task), > we can't take tasklist for writing in the softirq context. But even > if we could, this can't help in fact or I missed something. Ah, I had missed that constraint of call_rcu. (It's not mentioned in the kerneldoc comment directly, and is only buried in a couple of brief mentions in Documentation/RCU/.) I was thinking that would suffice because it does it between rcu_read_lock critical sections, but I guess it actually only does it after the last one and doesn't prevent a new one from having started on another CPU. (And we can't use it anyway.) Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Roland McGrath on 8 Jul 2010 20:50
> The reason for my "destroying the old" and "forming the new" is the > possibility of someone doing proc_task_readdir() when the group leader > does exec(), which causes all to die, and then the new process does > pthread_create(), forming a new thread group. Because proc_task_readdir() > neither holds a lock nor stays in an RCU read-side critical section > for the full /proc scan, the old group might really be destroyed from > the reader's point of view. I haven't tried to understand the /proc code. From the userland perspective, there is one thing called a "process" with a single ID that the kernel calls the TGID and everybody else calls the PID, and that container survives execs regardless of which of its threads do them. Listing /proc/TGID/task is the way userland (i.e. debuggers) enumerate all the threads (e.g. for attaching them all with ptrace). It's of course expected that threads will be coming and going, so userland expects to read it several times, until there were no new threads in the list after it attached all the ones from the last read (so it would know if those ones created any more). I can't quite tell but it sounds like you may be saying that this procedure won't work with rewinding the same fd. After an exec, that fd may point to a reaped former leader and yield no results. (Looking at the code now, it looks like readdir will fail with the unusual error ENOENT in that case, so userland could detect that case easily now.) To pick up the next iteration of that procedure correctly, you'd have to reopen /proc/TGID/task by name to get an fd associated with the new leader. That is the only thing I can think of that is meaningful in userland terms and might be what you mean by "destroying the old and forming the new". Is that it? But it also sounds like you may be saying that the lack of locking in proc_task_readdir means it could just punt out of its listing loop early at any time that the task it just looked at is reaped. Is that right? That is OK for userland if any given readdir call returns a short list--but not if it returns a premature EOF. It looks to me like that's possible too. If so, that is startling off hand, and breaks the userland expectation by the "least astonishment" principle. (That is, you can sometimes get a short listing showing a subset of threads that does not include some threads you previously saw as alive and are still alive.) It can also actually break the procedure I described above if one false EOF causes the reader to miss a new thread it hasn't seen before, so it thinks it has already stopped all the threads that are alive. I don't know of anything in userland using that procedure. But it's what I would have recommended if asked, before you brought this issue up. (strace -p does a single pass and is only intending to offer any guarantee if you've already finished stopping the thing with SIGSTOP first. gdb uses other means that amount to setting a breakpoint inside pthread_create before reading libpthread's data structures from user memory for the userland thread list without regard to the actual kernel thread situation.) I suppose we can just say that proc_task_readdir is entirely unreliable unless you're sure otherwise that threads are not being reaped while you read it, since that seems to be the truth of it. I would sure be happier as a userland programmer if the kernel gave something that I could rely on by some feasible race-noticing procedure akin to that above, but it's not the end of the world. Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |