From: Louis Rilling on 13 Jul 2010 17:50 Hi Eric, On 12/07/10 11:09 -0700, Eric W. Biederman wrote: > > Fix zap_pid_ns_processes so that it successfully waits for all of > the tasks in the pid namespace to be reaped, even if called for a > non-leader task of the init process. This guarantees that no task > can escpae the pid namespace, and that it is safe for proc_flush_task > to put the proc_mnt of the dead pid_namespace when pid 1 of the > pid namespace calls proc_flush_task. > > Before zap_pid_ns_processes was fixed to wait for all zombies > in the pid namespace to be reaped the easiest way to generate > a zombie that would escape the pid namespace would be to attach > a debugger to a process inside a pidnamespace from outside the > pid namespace and then exit the pid namespace. > > In the process of trying to fix this bug I have looked at a lot > of different options and a lot of different directions we can > go. There are several limiting factors. > > - We need to guarantee that the init process of a pid namespace > is not reaped before every other task in the pid namespace is > reaped. Wait succeeding on the init process of a pid namespace > gives the guarantee that all processes in the pid namespace > are dead and gone. Or more succinctly it is not possible to > escape from a pid namespace. > > The previous behaviour where some zombies could escape the pid > namespace violates the assumption made by some reapers of a pid > namespace init that all of the pid namespace cleanup has completed > by the time that init is reaped. > > - proc_flush_task needs to be called after each task is reaped. > Tasks are volatile and applications like top and ps frequently > touch every thread group directory in /proc which triggers dcache > entries to be created. If we don't remove those dcache entries > when tasks are reaped we can get a large build up of useless > inodes and dentries. Shrink_slab is designed to flush out useful > cache entries not useless ones so while in the big picture it doesn't > hurt if we leak a few if we leak a lot of dentries we put unnatural > pressure on the kernels memory managment. > > I sat down and attempted to measure the cost of calling > proc_flush_task with lat_tcp (from lmbench) and I get the same > range of latency readings wether or not proc_flush_task is > called. Which tells me the runtime cost of the existing > proc_flush_task is in the noise. > > By running find /proc/ > /dev/null with proc_flush_task > disabled and then examining the counts in the slabcache > I managed to see us growing about 84 proc_inodes per > iteration, which is pretty horrific. With proc_flush_task > enabled I don't see steady growth in any of the slab caches. > > - Mounts of the /proc need a referenece to the pid namespace > that doesn't go away until /proc is unmounted. Without > holding a reference to the pid namespace that lasts until > a /proc is unmounted it just isn't safe to lookup and display > pids in a particular pid_namespace. > > - The pid_namespace needs to be able to access proc_mnt until > the at least the last call of proc_flush_task, for a > pid_namespace. > > Currently there is a the circular reference between proc_mnt > and the pid_namespace that we break very carefully through > an interaction of zap_pid_ns_processes, and proc_flush_task. > That clever interaction going wrong is what caused oopses > that led us to realize we have a problem. > > Even if we fix the pid_namespace and the proc_mnt to have > conjoined lifetimes and the oopses are fixed we still have > the problem that zombie processes can escape the pid namespace. > Which appears to be a problem for people using pid_namespaces > as inescapable process containers. > > - fork/exec/waitpid is a common kernel path and as such we need > to keep the runtime costs down. Which means as much as possible > we want to keep from adding code (especially expensive code) > into the fork/exec/waitpid code path. > > Changing zap_pid_ns_processes to fix the problem instead of > changing the code elsewhere is one of the few solutions I have > seen that does not increase the cost of the lat_proc test from > lmbench. This patch looks like it is working (only a small RCU issue shown below). I couldn't try it yet though. I must admit that I am using a similar back-off solution in Kerrighed (not to solve the issue of proc_flush_task(), but for one of the reasons that you stated above: we want to be sure that all tasks of the namespace have been reaped), but I considered it too ugly to propose it for Linux ;) That said, this is probably the least intrusive solution we have seen yet. > > Reported-by: Louis Rilling <Louis.Rilling(a)kerlabs.com> > Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com> > --- > kernel/pid_namespace.c | 50 ++++++++++++++++++++++++++++++++++++----------- > 1 files changed, 38 insertions(+), 12 deletions(-) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index a5aff94..aaf2ab0 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -139,16 +139,20 @@ void free_pid_ns(struct kref *kref) > > void zap_pid_ns_processes(struct pid_namespace *pid_ns) > { > + struct task_struct *me = current; > int nr; > int rc; > struct task_struct *task; > > /* > - * The last thread in the cgroup-init thread group is terminating. > - * Find remaining pid_ts in the namespace, signal and wait for them > - * to exit. > + * The last task in the pid namespace-init threa group is terminating. s/threa/thread/ > + * Find remaining pids in the namespace, signal and wait for them > + * to to be reaped. > * > - * Note: This signals each threads in the namespace - even those that > + * By waiting for all of the tasks to be reaped before init is reaped > + * we provide the invariant that no task can escape the pid namespace. > + * > + * Note: This signals each task in the namespace - even those that > * belong to the same thread group, To avoid this, we would have > * to walk the entire tasklist looking a processes in this > * namespace, but that could be unnecessarily expensive if the > @@ -157,28 +161,50 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > * > */ > read_lock(&tasklist_lock); > - nr = next_pidmap(pid_ns, 1); > - while (nr > 0) { > - rcu_read_lock(); > + for (nr = next_pidmap(pid_ns, 0); nr > 0; nr = next_pidmap(pid_ns, nr)) { > > /* > * Any nested-container's init processes won't ignore the > * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser(). > */ > - task = pid_task(find_vpid(nr), PIDTYPE_PID); > - if (task) > + rcu_read_lock(); > + task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID); > + if (task && !same_thread_group(task, me)) > send_sig_info(SIGKILL, SEND_SIG_NOINFO, task); > - > rcu_read_unlock(); > - > - nr = next_pidmap(pid_ns, nr); > } > read_unlock(&tasklist_lock); > > + /* Nicely reap all of the remaining children in the namespac */ s/namespac/namespace/ > do { > clear_thread_flag(TIF_SIGPENDING); > rc = sys_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > + > + > +repeat: > + /* Brute force wait for any remaining tasks to pass unhash_process > + * in release_task. Once a task has passed unhash_process there > + * is no pid_namespace state left and they can be safely ignored. > + */ > + for (nr = next_pidmap(pid_ns, 1); nr > 0; nr = next_pidmap(pid_ns, nr)) { > + > + /* Are there any tasks alive in this pid namespace */ > + rcu_read_lock(); > + task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID); > + rcu_read_unlock(); > + if (task && !same_thread_group(task, me)) { rcu_read_unlock() should not be called before here, or task may be freed before calling same_thread_group(). > + clear_thread_flag(TIF_SIGPENDING); > + schedule_timeout_interruptible(HZ/10); > + goto repeat; > + } > + } > + /* At this point there are at most two tasks in the pid namespace. > + * These tasks are our current task, and if we aren't pid 1 the zombie > + * of pid 1. In either case pid 1 will be the final task reaped in this > + * pid namespace, as non-leader threads are self reaping and leaders > + * cannot be reaped until all of their siblings have been reaped. > + */ > > acct_exit_ns(pid_ns); > return; > -- > 1.6.5.2.143.g8cc62 Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
From: Sukadev Bhattiprolu on 14 Jul 2010 16:50 Eric W. Biederman [ebiederm(a)xmission.com] wrote: | | Changing zap_pid_ns_processes to fix the problem instead of | changing the code elsewhere is one of the few solutions I have | seen that does not increase the cost of the lat_proc test from | lmbench. I think its a good fix for the problem. but I have a nit and a minor comment below. Thanks, Sukadev | | Reported-by: Louis Rilling <Louis.Rilling(a)kerlabs.com> Reviewed-by: Sukadev Bhattiprolu <sukadev(a)linux.vnet.ibm.com> | Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com> | --- | kernel/pid_namespace.c | 50 ++++++++++++++++++++++++++++++++++++----------- | 1 files changed, 38 insertions(+), 12 deletions(-) | | diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c | index a5aff94..aaf2ab0 100644 | --- a/kernel/pid_namespace.c | +++ b/kernel/pid_namespace.c | @@ -139,16 +139,20 @@ void free_pid_ns(struct kref *kref) | | void zap_pid_ns_processes(struct pid_namespace *pid_ns) | { | + struct task_struct *me = current; | int nr; | int rc; | struct task_struct *task; | | /* | - * The last thread in the cgroup-init thread group is terminating. | - * Find remaining pid_ts in the namespace, signal and wait for them | - * to exit. | + * The last task in the pid namespace-init threa group is terminating. nit: thread | + * Find remaining pids in the namespace, signal and wait for them | + * to to be reaped. | * | - * Note: This signals each threads in the namespace - even those that | + * By waiting for all of the tasks to be reaped before init is reaped | + * we provide the invariant that no task can escape the pid namespace. | + * | + * Note: This signals each task in the namespace - even those that | * belong to the same thread group, To avoid this, we would have | * to walk the entire tasklist looking a processes in this | * namespace, but that could be unnecessarily expensive if the | @@ -157,28 +161,50 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) | * | */ | read_lock(&tasklist_lock); | - nr = next_pidmap(pid_ns, 1); | - while (nr > 0) { | - rcu_read_lock(); | + for (nr = next_pidmap(pid_ns, 0); nr > 0; nr = next_pidmap(pid_ns, nr)) { Is it necessary to start the search at nr == 0 ? We will find nr == 1 first and then immediately skip over it - bc same_thread_group() will be TRUE. | | /* | * Any nested-container's init processes won't ignore the | * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser(). | */ | - task = pid_task(find_vpid(nr), PIDTYPE_PID); | - if (task) | + rcu_read_lock(); | + task = pid_task(find_pid_ns(nr, pid_ns), PIDTYPE_PID); | + if (task && !same_thread_group(task, me)) | send_sig_info(SIGKILL, SEND_SIG_NOINFO, task); Also, if we start the search at 1, we could skip the only the other possible thread in the group with (nr != my_pid_nr) but its not really an optimization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Pages: 1 Prev: Fix SVM VMCB reset Next: [PATCH] sched: clear_buddies cleanup for hierarchical clearing |