Prev: sys_unshare: simplify the not-really-implemented CLONE_THREAD/SIGHAND/VM code
Next: FS: libfs, implement simple_write_to_buffer
From: Cyrill Gorcunov on 16 Apr 2010 10:40 On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote: [...] > > + > > +/* Callback function for perf event subsystem */ > > +void watchdog_overflow_callback(struct perf_event *event, int nmi, > > + struct perf_sample_data *data, > > + struct pt_regs *regs) > > +{ > > + int this_cpu = smp_processor_id(); > > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu); > > + > > + if (touch_ts == 0) { > > + __touch_watchdog(); > > + return; > > + } > > + > > + /* check for a hardlockup > > + * This is done by making sure our timer interrupt > > + * is incrementing. The timer interrupt should have > > + * fired multiple times before we overflow'd. If it hasn't > > + * then this is a good indication the cpu is stuck > > + */ > > + if (is_hardlockup(this_cpu)) { > > + /* only print hardlockups once */ > > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask))) > > + return; > > + > > + if (hardlockup_panic) > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > + else > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > + > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask)); > > > > May be have an arch spin lock there to update your cpu mask safely. > Hmm, this is NMI handler path so from what we protect this per-cpu data? Do I miss something? /me confused -- Cyrill -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on 16 Apr 2010 10:50 On Fri, Apr 16, 2010 at 06:32:32PM +0400, Cyrill Gorcunov wrote: > On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote: > [...] > > > + > > > +/* Callback function for perf event subsystem */ > > > +void watchdog_overflow_callback(struct perf_event *event, int nmi, > > > + struct perf_sample_data *data, > > > + struct pt_regs *regs) > > > +{ > > > + int this_cpu = smp_processor_id(); > > > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu); > > > + > > > + if (touch_ts == 0) { > > > + __touch_watchdog(); > > > + return; > > > + } > > > + > > > + /* check for a hardlockup > > > + * This is done by making sure our timer interrupt > > > + * is incrementing. The timer interrupt should have > > > + * fired multiple times before we overflow'd. If it hasn't > > > + * then this is a good indication the cpu is stuck > > > + */ > > > + if (is_hardlockup(this_cpu)) { > > > + /* only print hardlockups once */ > > > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask))) > > > + return; > > > + > > > + if (hardlockup_panic) > > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > > + else > > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > > + > > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask)); > > > > > > > > May be have an arch spin lock there to update your cpu mask safely. > > > > Hmm, this is NMI handler path so from what we protect this per-cpu data? > Do I miss something? /me confused The cpu mask is not per cpu here, this is a shared bitmap, so you can race against other cpus NMIs. That said, as I suggested, having a per cpu var that we set when we warned would be much better than a spinlock here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Don Zickus on 16 Apr 2010 10:50 On Fri, Apr 16, 2010 at 06:32:32PM +0400, Cyrill Gorcunov wrote: > On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote: > [...] > > > + > > > +/* Callback function for perf event subsystem */ > > > +void watchdog_overflow_callback(struct perf_event *event, int nmi, > > > + struct perf_sample_data *data, > > > + struct pt_regs *regs) > > > +{ > > > + int this_cpu = smp_processor_id(); > > > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu); > > > + > > > + if (touch_ts == 0) { > > > + __touch_watchdog(); > > > + return; > > > + } > > > + > > > + /* check for a hardlockup > > > + * This is done by making sure our timer interrupt > > > + * is incrementing. The timer interrupt should have > > > + * fired multiple times before we overflow'd. If it hasn't > > > + * then this is a good indication the cpu is stuck > > > + */ > > > + if (is_hardlockup(this_cpu)) { > > > + /* only print hardlockups once */ > > > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask))) > > > + return; > > > + > > > + if (hardlockup_panic) > > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > > + else > > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu); > > > + > > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask)); > > > > > > > > May be have an arch spin lock there to update your cpu mask safely. > > > > Hmm, this is NMI handler path so from what we protect this per-cpu data? > Do I miss something? /me confused It's not per-cpu which is the problem I believe. If you have multiple cpus dealing with hardlockups than they can overwrite each other's data. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Frederic Weisbecker on 16 Apr 2010 11:00 On Fri, Apr 16, 2010 at 04:53:11PM +0200, Peter Zijlstra wrote: > On Fri, 2010-04-16 at 16:46 +0200, Frederic Weisbecker wrote: > > > > May be have an arch spin lock there to update your cpu mask safely. > > > > > > > > > > Hmm, this is NMI handler path so from what we protect this per-cpu data? > > > Do I miss something? /me confused > > > > > > The cpu mask is not per cpu here, this is a shared bitmap, so you > > can race against other cpus NMIs. > > > > That said, as I suggested, having a per cpu var that we set when we > > warned would be much better than a spinlock here. > > Every time you think NMI and spinlock think FAIL. In fact I was first inspired by the x86 nmi watchdog handler that does this spinlock to protect cpumask, but realize just after my FAIL ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 16 Apr 2010 11:00
On Fri, 2010-04-16 at 16:46 +0200, Frederic Weisbecker wrote: > > > May be have an arch spin lock there to update your cpu mask safely. > > > > > > > Hmm, this is NMI handler path so from what we protect this per-cpu data? > > Do I miss something? /me confused > > > The cpu mask is not per cpu here, this is a shared bitmap, so you > can race against other cpus NMIs. > > That said, as I suggested, having a per cpu var that we set when we > warned would be much better than a spinlock here. Every time you think NMI and spinlock think FAIL. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |