Prev: [PATCH] vfs rerepost: fix RCU-lockdep false positive due to /proc
Next: Boston Linux Power Management Mini-summit - August 9th, 2010
From: Oleg Nesterov on 24 Jun 2010 15:30 On 06/24, Louis Rilling wrote: > > [ Resending because of buggy quotes in Eric's address. Sorry for the noise. ] > > On 06/19, Oleg Nesterov wrote: > > And the last one on top of this series, before I go away from this > > thread ;) > > > > Since my further fixes were not approved, I think at least it makes > > sense to cleanup pid_ns_release_proc() logic a bit. > > It's completely untested and could be split into three patches. But I think that > it solves the issues found so far, and that it will work with Eric's > unshare(CLONE_NEWPID) too. > > What do you think about this approach? Oh. I shouldn't continue to participate in this discussion... I don't have the time and my previous patch proves that my patches should be ignored ;) But, looking at this patch, > - defer pid_ns_release_proc()->mntput() to a worqueue context, so that > pid_ns_release_proc() can be called in atomic context; OK, not good but this is what I did too, > - introduce pid_ns->nr_pids, so that we can count the number of pids > allocated by alloc_pidmap(); and this adds the extra code to alloc/free pidmap. > - move the call to pid_ns_prepare_proc() to alloc_pid(), where we know > when the first pid of a namespace is allocated; This is what I personally dislike. I do not think pid_ns_prepare_proc() should depend on the fact that the first pid was already allocated. And, this subjective, yes, but it looks a bit strange that upid->nr has a reference to proc_mnt. And of course, imho it would nice to not create the circular reference we currently have. Louis, Eric. I am attaching my 2 patches (on top of cleanups) again. Could you take a look? Changes: - pid_ns_release_proc() nullifies sb->s_fs_info - proc_pid_lookup() and proc_self_readlink() check ns != NULL (this is sb->s_fs_info) I even tried to test this finally, seems to work. I am not going to argue if you prefer Louis's approach. But I will appreciate if you explain why my fix is wrong. I am curious because I spent 3 hours doing grep fs/proc ;) Oleg.
From: Louis Rilling on 25 Jun 2010 06:30 On 24/06/10 21:18 +0200, Oleg Nesterov wrote: > On 06/24, Louis Rilling wrote: > > > > [ Resending because of buggy quotes in Eric's address. Sorry for the noise. ] > > > > On 06/19, Oleg Nesterov wrote: > > > And the last one on top of this series, before I go away from this > > > thread ;) > > > > > > Since my further fixes were not approved, I think at least it makes > > > sense to cleanup pid_ns_release_proc() logic a bit. > > > > It's completely untested and could be split into three patches. But I think that > > it solves the issues found so far, and that it will work with Eric's > > unshare(CLONE_NEWPID) too. > > > > What do you think about this approach? > > Oh. I shouldn't continue to participate in this discussion... I don't have > the time and my previous patch proves that my patches should be ignored ;) Oleg, I hope that you realize that we all appreciate your efforts to solve those issues. > > But, looking at this patch, > > > - defer pid_ns_release_proc()->mntput() to a worqueue context, so that > > pid_ns_release_proc() can be called in atomic context; > > OK, not good but this is what I did too, > > > - introduce pid_ns->nr_pids, so that we can count the number of pids > > allocated by alloc_pidmap(); > > and this adds the extra code to alloc/free pidmap. Hopefully this is not much in alloc_pidmap() since we can expect that nr_pids and last_pid are in the same cache line. > > > - move the call to pid_ns_prepare_proc() to alloc_pid(), where we know > > when the first pid of a namespace is allocated; > > This is what I personally dislike. I do not think pid_ns_prepare_proc() > should depend on the fact that the first pid was already allocated. > > And, this subjective, yes, but it looks a bit strange that upid->nr > has a reference to proc_mnt. I presume that you wanted to say upid->ns. > > And of course, imho it would nice to not create the circular reference > we currently have. > > > Louis, Eric. > > I am attaching my 2 patches (on top of cleanups) again. Could you take > a look? > > Changes: > > - pid_ns_release_proc() nullifies sb->s_fs_info > > - proc_pid_lookup() and proc_self_readlink() check ns != NULL > (this is sb->s_fs_info) > > I even tried to test this finally, seems to work. > > I am not going to argue if you prefer Louis's approach. But I will appreciate > if you explain why my fix is wrong. I am curious because I spent 3 hours doing > grep fs/proc ;) > > Oleg. > [PATCH 1/2] pid_ns: move destroy_pid_namespace() into workqueue context > > A separate patch to simplify the review of the next change. > > Move destroy_pid_namespace() into workqueue context. This allows us to do > mntput() from free_pid_ns() paths, see the next patch. > > Add the new member, "struct work_struct destroy" into struct pid_namespace > and change free_pid_ns() to call destroy_pid_namespace() via schedule_work(). > > The patch looks a bit complicated because it also moves copy_pid_ns() up. This patch itself does not look wrong to me. [...] > [PATCH 2/2] pid_ns: refactor the buggy pid_ns_release_proc() logic > > pid_namespace holds ns->proc_mnt, while this vfsmount has a referene to > the namespace via PROC_I(sb->s_root->d_inode)->pid. To break this circle > /sbin/init does mntput() in pid_ns_release_proc(). See 6f4e6433. > > But we have the following problems: > > - Nobody does mntput() if copy_process() fails after > pid_ns_prepare_proc(). > > - proc_flush_task() checks upid->nr == 1 to verify we are init, > this is wrong if a multi-threaded init does exec. > > - As Louis pointed out, this namespace can have the detached > EXIT_DEAD tasks which can use ns->proc_mnt after this mntput(). > > With this patch only pid_namespace has a reference to ns->proc_mnt, and > mntput(ns->proc_mnt) is called by destroy_pid_namespace() paths when we > know that this ns must not have any references (in particular, there are > no pids in this namespace). > > Changes: > > - kill proc_flush_task()->pid_ns_release_proc() > > - change fs/proc/root.c so that we don't create the "artificial" > references to the namespace or its pid==1. > > - change destroy_pid_namespace() to call pid_ns_release_proc(). > > - change pid_ns_release_proc() to clear s_root->d_inode->pid and > sb->s_fs_info. The caller is destroy_pid_namespace(), both pid > and ns must not have any reference. > > - change proc_self_readlink() and proc_pid_lookup() to check > sb->s_fs_info != NULL to detect the case when the proc fs is > kept mounted after pid_ns_release_proc(). This last point is what made me worry about your approach so far, although I did not take time to spot the precise issues. Unfortunately I don't see what the checks you added in proc_self_readlink(), proc_self_follow_link() and proc_pid_lookup() buy. What does prevent destroy_pid_namespace() from running concurrently? Maybe RCU could help in those cases? Moreover, I think that proc_pid_readdir() should get some check too. So what make me really worry about this approach is that it looks fragile. We should find all locations where procfs expects that sb->s_fs_info is a valid pid namespace, even if no more pids live in this namespace. I presume that this is what you did, but this needs double-check at least. Grepping for s_fs_info in fs/proc gives me: - proc_test_super(), proc_set_super(): Ok because tasks only mount procfs of their active pid namespace. - proc_kill_sb(): Your patch makes it ok. - proc_single_show(): Ok because a pid must be alive to get here. - proc_self_readlink(), proc_self_follow_link(), proc_pid_lookup(): Your patch could make them ok, provided that we protect against destroy_pid_namespace()->kmem_cache_free(). - proc_pid_readdir(): Needs similar check and protection to proc_pid_lookup(), but there is another issue: next_tgid() can find a dying task: next_tgid() finds a task task dies last reference to ns is dropped destroy_pid_namespace() proc_pid_readdir() ->proc_pid_fill_cache() ->proc_fill_cache() ->proc_pid_instantiate() ->proc_pid_make_inode() ->new_inode() ->alloc_inode() ->kmem_cache_alloc(, GFP_KERNEL) blocks So RCU would not be sufficient to protect proc_pid_readdir(). We could make pid_ns_release_proc() lock root inode's mutex before clearing s_fs_info: void pid_ns_release_proc(struct pid_namespace *ns) { struct inode *root_inode; if (ns->proc_mnt) { root_inode = ns->proc_mnt->mnt_sb->s_root->d_inode; mutex_lock(&root_inode->i_mutex); ns->proc_mnt->mnt_sb->s_fs_info = NULL; PROC_I(root_inode)->pid = NULL; mutex_unlock(&root_inode->i_mutex); mntput(ns->proc_mnt); } } This would also solve the issue for proc_pid_lookup() btw. - proc_task_lookup(), proc_task_readdir(): Ok, because a pid is pinned by dir. However, I don't think that I'm trustable enough to validate all of this. Waiting for Eric's comments at least. Again, thanks a lot Oleg! Louis > > Reported-by: Louis Rilling <louis.rilling(a)kerlabs.com> > Signed-off-by: Oleg Nesterov <oleg(a)redhat.com> > --- > > kernel/pid_namespace.c | 2 ++ > fs/proc/base.c | 20 ++++++++++++-------- > fs/proc/root.c | 11 +++++++---- > 3 files changed, 21 insertions(+), 12 deletions(-) > > --- 35-rc3/kernel/pid_namespace.c~PNS_6_BREAK_CIRCLE 2010-06-24 15:17:46.000000000 +0200 > +++ 35-rc3/kernel/pid_namespace.c 2010-06-24 20:48:18.000000000 +0200 > @@ -128,6 +128,8 @@ static void destroy_pid_namespace(struct > { > int i; > > + pid_ns_release_proc(ns); > + > for (i = 0; i < PIDMAP_ENTRIES; i++) > kfree(ns->pidmap[i].page); > kmem_cache_free(pid_ns_cachep, ns); > --- 35-rc3/fs/proc/base.c~PNS_6_BREAK_CIRCLE 2010-06-24 15:16:21.000000000 +0200 > +++ 35-rc3/fs/proc/base.c 2010-06-24 20:48:18.000000000 +0200 > @@ -2349,11 +2349,17 @@ static const struct file_operations proc > /* > * /proc/self: > */ > + > +static inline pid_t self_tgid(struct dentry *dentry) > +{ > + struct pid_namespace *ns = dentry->d_sb->s_fs_info; > + return ns ? task_tgid_nr_ns(current, ns) : 0; > +} > + > static int proc_self_readlink(struct dentry *dentry, char __user *buffer, > int buflen) > { > - struct pid_namespace *ns = dentry->d_sb->s_fs_info; > - pid_t tgid = task_tgid_nr_ns(current, ns); > + pid_t tgid = self_tgid(dentry); > char tmp[PROC_NUMBUF]; > if (!tgid) > return -ENOENT; > @@ -2363,8 +2369,7 @@ static int proc_self_readlink(struct den > > static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd) > { > - struct pid_namespace *ns = dentry->d_sb->s_fs_info; > - pid_t tgid = task_tgid_nr_ns(current, ns); > + pid_t tgid = self_tgid(dentry); > char *name = ERR_PTR(-ENOENT); > if (tgid) { > name = __getname(); > @@ -2745,10 +2750,6 @@ void proc_flush_task(struct task_struct > proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, > tgid->numbers[i].nr); > } > - > - upid = &pid->numbers[pid->level]; > - if (upid->nr == 1) > - pid_ns_release_proc(upid->ns); > } > > static struct dentry *proc_pid_instantiate(struct inode *dir, > @@ -2796,6 +2797,9 @@ struct dentry *proc_pid_lookup(struct in > goto out; > > ns = dentry->d_sb->s_fs_info; > + if (!ns) > + goto out; > + > rcu_read_lock(); > task = find_task_by_pid_ns(tgid, ns); > if (task) > --- 35-rc3/fs/proc/root.c~PNS_6_BREAK_CIRCLE 2010-06-23 22:06:01.000000000 +0200 > +++ 35-rc3/fs/proc/root.c 2010-06-24 20:48:18.000000000 +0200 > @@ -31,7 +31,7 @@ static int proc_set_super(struct super_b > struct pid_namespace *ns; > > ns = (struct pid_namespace *)data; > - sb->s_fs_info = get_pid_ns(ns); > + sb->s_fs_info = ns; > return set_anon_super(sb, NULL); > } > > @@ -68,7 +68,7 @@ static int proc_get_sb(struct file_syste > struct proc_inode *ei = PROC_I(sb->s_root->d_inode); > if (!ei->pid) { > rcu_read_lock(); > - ei->pid = get_pid(find_pid_ns(1, ns)); > + ei->pid = find_pid_ns(1, ns); > rcu_read_unlock(); > } > } > @@ -83,7 +83,6 @@ static void proc_kill_sb(struct super_bl > > ns = (struct pid_namespace *)sb->s_fs_info; > kill_anon_super(sb); > - put_pid_ns(ns); And there is now no need to read sb->s_fs_info. > } > > static struct file_system_type proc_fs_type = { > @@ -209,5 +208,9 @@ int pid_ns_prepare_proc(struct pid_names > > void pid_ns_release_proc(struct pid_namespace *ns) > { > - mntput(ns->proc_mnt); > + if (ns->proc_mnt) { > + ns->proc_mnt->mnt_sb->s_fs_info = NULL; > + PROC_I(ns->proc_mnt->mnt_sb->s_root->d_inode)->pid = NULL; > + mntput(ns->proc_mnt); > + } > } -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
From: Oleg Nesterov on 25 Jun 2010 08:30 On 06/25, Louis Rilling wrote: > > On 24/06/10 21:18 +0200, Oleg Nesterov wrote: > > > > and this adds the extra code to alloc/free pidmap. > > Hopefully this is not much in alloc_pidmap() since we can expect that nr_pids > and last_pid are in the same cache line. This also adds atomic op. But I mostly dislike the pid-ns-specific complications itself (including the mount-after-the-first-alloc_pid dependancy), not the minor perfomance penalty. But see below... > > And, this subjective, yes, but it looks a bit strange that upid->nr > > has a reference to proc_mnt. > > I presume that you wanted to say upid->ns. I meant ns->nr_pids ;) > This last point is what made me worry about your approach so far, although I did > not take time to spot the precise issues. Unfortunately I don't see what the > checks you added in proc_self_readlink(), proc_self_follow_link() and > proc_pid_lookup() buy. What does prevent destroy_pid_namespace() from running > concurrently? Maybe RCU could help in those cases? Very good point. And the strong argument against this approach. > Moreover, I think that proc_pid_readdir() should get some check too. Well, it checks ->reaper != NULL, that is why I don't verify ns. But yes, we have the same race (races) you pointed out here. > void pid_ns_release_proc(struct pid_namespace *ns) > { > struct inode *root_inode; > > if (ns->proc_mnt) { > root_inode = ns->proc_mnt->mnt_sb->s_root->d_inode; > > mutex_lock(&root_inode->i_mutex); > ns->proc_mnt->mnt_sb->s_fs_info = NULL; > PROC_I(root_inode)->pid = NULL; > mutex_unlock(&root_inode->i_mutex); > > mntput(ns->proc_mnt); > } > } > > This would also solve the issue for proc_pid_lookup() btw. Looks like you are right, but this doesn't help proc_self_readlink(). I think we can fix all these problems, but I no longer think this approach can pretend to simplify the code. No, it will make the code more complex/ugly and potentially more buggy. Louis, thank you very much. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Sukadev Bhattiprolu on 25 Jun 2010 14:30 Louis Rilling [Louis.Rilling(a)kerlabs.com] wrote: | - proc_pid_readdir(): | Needs similar check and protection to proc_pid_lookup(), but there is another | issue: next_tgid() can find a dying task: Hmm, I thought proc_pid_readdir() would be a problem too but convinced myself that it would not - since a process running proc_pid_readdir() would have a reference to the pid namespace, in which case destroy_pid_ns() would not be called. | | next_tgid() finds a task | task dies | last reference to ns is dropped | destroy_pid_namespace() caller of next_tgid() holds a ref to pid-ns right ? Sukadev -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Oleg Nesterov on 25 Jun 2010 15:40
On 06/25, Sukadev Bhattiprolu wrote: > > Louis Rilling [Louis.Rilling(a)kerlabs.com] wrote: > | - proc_pid_readdir(): > | Needs similar check and protection to proc_pid_lookup(), but there is another > | issue: next_tgid() can find a dying task: > > Hmm, I thought proc_pid_readdir() would be a problem too but convinced myself > that it would not - since a process running proc_pid_readdir() would have > a reference to the pid namespace, Where does this reference comes from ? proc_pid_readdir() pins the task_struct (ns->child_reaper), not the pid/ns. But I won't be surprised if I am wrong again ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |