Prev: [PATCH] zorro: Fix reading of proc/bus/zorro/* in small chunks
Next: Ext2/3 Filesystem Analysis
From: Andrew Morton on 14 Jun 2010 20:10 On Fri, 11 Jun 2010 15:49:54 -0700 Salman <sqazi(a)google.com> wrote: > A program that repeatedly forks and waits is susceptible to having the > same pid repeated, especially when it competes with another instance of the > same program. This is really bad for bash implementation. Furthermore, > many shell scripts assume that pid numbers will not be used for some length > of time. > > Race Description: > > ... > > diff --git a/kernel/pid.c b/kernel/pid.c > index e9fd8c1..fbbd5f6 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid) > atomic_inc(&map->nr_free); > } > > +/* > + * If we started walking pids at 'base', is 'a' seen before 'b'? > + */ > +static int pid_before(int base, int a, int b) > +{ > + /* > + * This is the same as saying > + * > + * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT > + * and that mapping orders 'a' and 'b' with respect to 'base'. > + */ > + return (unsigned)(a - base) < (unsigned)(b - base); > +} pid.c uses an exotic mix of `int' and `pid_t' to represent pids. `int' seems to preponderate. > +/* > + * We might be racing with someone else trying to set pid_ns->last_pid. > + * We want the winner to have the "later" value, because if the > + * "earlier" value prevails, then a pid may get reused immediately. > + * > + * Since pids rollover, it is not sufficient to just pick the bigger > + * value. We have to consider where we started counting from. > + * > + * 'base' is the value of pid_ns->last_pid that we observed when > + * we started looking for a pid. > + * > + * 'pid' is the pid that we eventually found. > + */ > +static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid) > +{ > + int prev; > + int last_write = base; > + do { > + prev = last_write; > + last_write = cmpxchg(&pid_ns->last_pid, prev, pid); > + } while ((prev != last_write) && (pid_before(base, last_write, pid))); > +} <gets distracted> hm. For a long time cmpxchg() wasn't available on all architectures. That _seems_ to have been fixed. arch/score assumes that cmpxchg() operates on unsigned longs. arch/powerpc plays the necessary games to make 4- and 8-byte scalars work. ia64 handles 1, 2, 4 and 8-byte quantities. arm handles 1, 2 and 4-byte scalars. as does blackfin. So from the few architectures I looked at, it seems that we do indeed handle cmpxchg() on all architectures although not very consistently. arch/score will blow up if someone tries to use cmpxchg() on 1- or 2-byte scalars. <looks at the consumers> infiniband deos cmpxchg() on u64*'s, which will blow up on many architectures. Using grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/ I can't see any cmpxchg() callers in truly generic code. lockdep and kernel/trace/ring_buffer.c aren't used on the more remote architectures, I think. Traditionally, atomic_cmpxchg() was the safe and portable one to use. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: tytso on 14 Jun 2010 21:00 On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote: > Using > > grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/ > > I can't see any cmpxchg() callers in truly generic code. lockdep and > kernel/trace/ring_buffer.c aren't used on the more remote > architectures, I think. What about: drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new); kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl The last is quite new --- I had just recently done a similar set of research as you did before accepting the patch that added cmpxchg into ext4 (during the last merge window), and I thought cmpxchg() had entered the "supported by all architectures" category. It looked like it had only recently reached state, but I had reached the conclusion that it was safe to use. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on 14 Jun 2010 22:00 On Mon, 14 Jun 2010 20:56:19 -0400 tytso(a)mit.edu wrote: > On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote: > > Using > > > > grep -r '[ ]cmpxchg[^_]' . | grep -v /arch/ > > > > I can't see any cmpxchg() callers in truly generic code. lockdep and > > kernel/trace/ring_buffer.c aren't used on the more remote > > architectures, I think. > > What about: > > drivers/gpu/drm/drm_lock.c: prev = cmpxchg(lock, old, new); > kernel/lockdep.c: n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de I put these in the not-used-on-weird-architectures bucket. > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc I guess that'll flush out any stragglers. I suspect sched_clock.c might be generating fair amounts of code which UP builds don't need. > fs/btrfs/inode.c: if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN > fs/ext4/inode.c: } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl > > The last is quite new --- I had just recently done a similar set of > research as you did before accepting the patch that added cmpxchg into > ext4 (during the last merge window), and I thought cmpxchg() had > entered the "supported by all architectures" category. It looked like > it had only recently reached state, but I had reached the conclusion > that it was safe to use. I think you're probably right, as long as one sticks with 4-byte scalars. The cmpxchg-is-now-generic change snuck in under the radar (mine, at least). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Paul Mackerras on 14 Jun 2010 23:30 On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote: > > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc > > I guess that'll flush out any stragglers. And break most non-x86 32-bit architectures, including 32-bit powerpc. Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, and it looks like only x86 and ia64 set it. Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on 15 Jun 2010 00:30
On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <paulus(a)samba.org> wrote: > On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote: > > > > kernel/sched_clock.c: if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc > > > > I guess that'll flush out any stragglers. > > And break most non-x86 32-bit architectures, including 32-bit powerpc. If CONFIG_SMP=y, yes. On UP there's a generic implementation (include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h) > Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK > is set, and it looks like only x86 and ia64 set it. > If that happens then the best fix is for those architectures to get themselves a cmpxchg64(). Unless for some reason it's simply unimplementable? Worst case I guess one could use a global spinlock. Second-worst-case: hashed spinlocks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |