CRED: Fix __task_cred()'s lockdep check and banner comment [Kernel]

Prev: Fix spelling contorller -> controller in comments
Next: Documentation update broken web addresses.

From: Oleg Nesterov on 4 Aug 2010 09:30

On 08/03, Linus Torvalds wrote:
>
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells(a)redhat.com> wrote:
> >
> > A previous patch:
> >
> > � � � �commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> > � � � �Author: David Howells <dhowells(a)redhat.com>
> > � � � �Date: � Thu Jul 29 12:45:55 2010 +0100
> > � � � �Subject: CRED: Fix __task_cred()'s lockdep check and banner comment

I am not sure I understand this patch.

__task_cred() checks rcu_read_lock_held() || task_is_dead(), and
task_is_dead(task) is ((task)->exit_state != 0).

OK, task_is_dead() is valid for, say, wait_task_zombie(). But
wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive.
The code is correct, this thread can do nothing until we drop ->siglock.

> > fixed the lockdep checks on __task_cred(). �This has shown up a place in the
> > signalling code where a lock should be held - namely that
> > check_kill_permission() requires its callers to hold the RCU lock.
>
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock?

Yes, for_each_process() is rcu-safe by itself.

> So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().

As for kill_something_info(), I think yes. I even sent (iirc) the
protoptype patch a long ago. We can't just remove tasklist, we should
avoid the races fork/exit/exec in the kill(-1, SIG) case.

The same for kill_pgrp/__kill_pgrp_info(). We need tasklist to ensure
that nobody in this group can escape the signal. This seems solveable
too, it was even discussed a bit.

> > It's may be that it would be better to add RCU read lock calls in
> > group_send_sig_info() only, around the call to check_kill_permission().

I must admit, at first glance changing check_kill_permission() to take
rcu lock looks better to me.

> On the
> > other hand, some of the callers are either holding the RCU read lock already,
> > or have disabled interrupts,

Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice
(unless I missed the new version of RCU), but, say, posix_timer_event()
takes rcu_read_lock() exactly because I thought we shouldn't assume that
irqs_disabled() acts as rcu_read_lock() ?

There are other examples of rcu_read_lock() under local_irq_disable().

> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -773,6 +773,7 @@ static void forget_original_parent(struct task_struct *father)
> >
> > � � � �exit_ptrace(father);
> >
> > + � � � rcu_read_lock();
> > � � � �write_lock_irq(&tasklist_lock);
> > � � � �reaper = find_new_reaper(father);

No, this doesn't look right. find_new_reaper() can drop tasklist and sleep.

Besides, this patch conflicts with the change in -mm tree. And imho this
looks a bit as "action at a distance".

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Oleg Nesterov on 4 Aug 2010 11:20

On 08/04, David Howells wrote:
>
> Oleg Nesterov <oleg(a)redhat.com> wrote:
>
> > On 08/03, Linus Torvalds wrote:
> > >
> > > On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells(a)redhat.com> wrote:
> > > >
> > > > A previous patch:
> > > >
> > > > � � � �commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> > > > � � � �Author: David Howells <dhowells(a)redhat.com>
> > > > � � � �Date: � Thu Jul 29 12:45:55 2010 +0100
> > > > � � � �Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> >
> > I am not sure I understand this patch.
>
> You are talking about the 'previous patch'?
>
> > __task_cred() checks rcu_read_lock_held() || task_is_dead(), and
> > task_is_dead(task) is ((task)->exit_state != 0).
> >
> > OK, task_is_dead() is valid for, say, wait_task_zombie(). But
> > wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive.
> > The code is correct, this thread can do nothing until we drop ->siglock.
>
> The problem is that we have to tell lockdep this. Just checking in
> __task_cred() that siglock is held is insufficient. That doesn't handle, say,
> sys_setuid() from changing the credentials, and effectively skips the check in
> places where it mustn't.
>
> Similarly, having interrupts disabled on the CPU we're running on doesn't help
> either, since it doesn't stop another CPU replacing those credentials.
>
> There are ways of dealing with wait_task_stopped():
>
> (1) Place an rcu_read_lock()'d section around the call to __task_cred().

Sure, this solves the problem. But probably this needs a comment to
explain why do we take rcu lock.

OTOH, wait_task_continued() does need rcu_read_lock(), the task is running.

UNLESS we believe that local_irq_disable() makes rcu_read_lock() unnecessary,
see below.

> (2) Make __task_cred()'s lockdep understand about the target task being
> stopped whilst we hold its siglock.

May be... but we have so many special cases. Say, fill_psinfo()->__task_cred().
This is called under rcu lock, but it is not needed. The task is either
current or it sleeps in exit_mm().

I mean, perhaps it is better to either always require rcu_read_lock()
around __task_cred() even if it is not needed, or do not use
rcu_dereference_check() at all.

In any case, task_is_dead() doesn't help afaics, it is only useful for
wait_task_zombie().

> > I must admit, at first glance changing check_kill_permission() to take
> > rcu lock looks better to me.
>
> I think group_send_sig_info() would be better. The only other caller of
> c_k_p() already has to hold the RCU read lock for other reasons.
>
> How about the attached patch then?

Agreed, the patch looks fine to me.

> > > > On the other hand, some of the callers are either holding the RCU read
> > > > lock already, or have disabled interrupts,
> >
> > Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice
> > (unless I missed the new version of RCU), but, say, posix_timer_event()
> > takes rcu_read_lock() exactly because I thought we shouldn't assume that
> > irqs_disabled() acts as rcu_read_lock() ?
>
> This CPU can't be preempted if it can't be interrupted, I think.

Yes, please note "It does in practice" above.

My question is, should/can we rely on this fact? Or should we assume
that nothing except rcu_read_lock() implies rcu_read_lock() ?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Oleg Nesterov on 4 Aug 2010 11:50

On 08/04, David Howells wrote:
>
> Oleg Nesterov <oleg(a)redhat.com> wrote:
>
> > > > I must admit, at first glance changing check_kill_permission() to take
> > > > rcu lock looks better to me.
> > >
> > > I think group_send_sig_info() would be better. The only other caller of
> > > c_k_p() already has to hold the RCU read lock for other reasons.
> > >
> > > How about the attached patch then?
> >
> > Agreed, the patch looks fine to me.
>
> Can I take that as an Acked-by?

Yes, thanks, please feel free to add

Acked-by: Oleg Nesterov <oleg(a)redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Oleg Nesterov on 5 Aug 2010 14:30

On 08/05, Linus Torvalds wrote:
>
> On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
> <ebiederm(a)xmission.com> wrote:
> >
> > No. �When we send a signal to multiple processes it needs to be an
> > atomic operation so that kill -KILL -pgrp won't let processes escape.
> > It is what posix specifies, it is what real programs expect, and it
> > is the useful semantic in userspace.
>
> Ok. However, in that case, it's not really about the whole list
> traversal, it's a totally separate thing, and it's really sad that we
> end up using the (rather hot) tasklist_lock for something like that.
> With the dcache/inode locks basically going away, I think
> tasklist_lock ends up being one of the few hot locks left.
>
> Wouldn't it be much nicer to:
> - make it clear that all the "real" signal locking can rely on RCU
> - use a separate per-pgrp lock that ends up being the one that gives
> the signal _semantic_ meaning?
>
> That would automatically document why we get the lock too, which
> certainly isn't clear from the code as-is.
>
> The per-pgrp lock might be something as simple as a silly hash that
> just spreads out the process groups over some random number of simple
> spinlocks.

I still think we can avoid the new lock and rely on RCU in
kill_something_info() and kill_pgrp().

I am attaching my old email below. It talks about pgrp, however I
think kill_something_info() is almost the same thing.

Oleg.

On 12/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg(a)redhat.com> writes:
>
> > On 12/05, Eric W. Biederman wrote:
> >>
> >> Atomically sending signal to every member of a process group, is the
> >> big fly in the ointment I am aware of. Last time I looked I could
> >> not see how to convert it rcu.
> >
> > I am not sure, but iirc we can do this lockless (under rcu_lock).
> > We need to modify pid_link to use list_entry and attach_pid() should
> > add the new task to the end. Of course we need more changes, but
> > (again iirc) this is not too hard.
>
> The problem is that even adding to the end of the list, we could run
> into a deleted entry and not see the new end of the list.
>
> Suppose when we start iterating the list we have:
>
> A -> B -> C -> D
>
> Then someone deletes some of the entries while we are iterating the list.
>
> A ->
> B' -> C' -> D'
>
> We will continue on traversing through the deleted entries.
>
> Then someone adds a new entry to the end of the list.
>
> A-> N
>
> Since we are at B', C' or D' we will never see the new entry on the
> end of the list.

Yes, but who can add the new entry?

Let's forget about setpgrp/etc for the moment, I think we have "races"
with or without tasklist. Say, setpgrp() can add the new process to the
already "killed" pgrp.

Then, I think the only important case is SIGKILL/SIGSTOP (or other
signals which can't be blockes/ignored). We must kill/stop the entire
pgrp, we must not race with fork() and miss a child.

In this case I _think_ rcu_read_lock() is enough,

rcu_read_lock()

list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
group_send_sig_info(sig, task);

rcu_read_unlock();

except group_send_sig_info() can race with mt-exec, but this is simple
to fix.

If we send a signal (not necessary SIGKILL) to a process P, we must see
all childs which were forked by P, both send_signal() and copy_process()
take the same ->siglock, we must see the result of list_add_tail_rcu().
And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.

If list_for_each_entry() does not see the exited process P, this means
we see the result of list_del_rcu(). But this also means we must the
the result of the previous list_add_rcu().

IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
don't see the new entry on list, we must see the new one, right?

(I am ignoring the case when list_for_each_entry_rcu() sees a process
P but lock_task_sighand(P) fails, I think this is the same as if we
we missed P)

Now suppose a signal is blocked/ignored or has a handler. In this case
we can miss a child, but I think this is OK, we can pretend the new
child was forked after kill_pgrp() completes. Say, this child C was
forked by some process P. We can miss C only if it was forked after
we already sent the signal to P.

However. I do not pretend the reasoning above is "complete", and
perhaps I missed something else.

> Additionally we have the other possibility that if a child is forking
> we send the signal to the parent after the child forks away but before
> the child joins whichever list we are walking, and we complete our
> traversal without seeing the child.

Not sure I understand... But afaics this case is covered above.
->siglock should serialize this, copy_process() does attach_pid()
under this lock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: Fix spelling contorller -> controller in comments
Next: Documentation update broken web addresses.