From: Andi Kleen on 24 Jun 2010 10:10 > Please, as Peter and Boris asked you already, quote a concrete, specific > example: It was already in my answer to Peter. > > 'Specific event X occurs, kernel wants/needs to do Y. This cannot be done > via the suggested method due to Z.' > > Your generic arguments look wrong (to the extent they are specified) and it > makes it much easier and faster to address your points if you dont blur them > by vagaries. It's one of the fundamental properties of recoverable errors. Error happens. Machine check or NMI or other exception happens. That exception runs on the exception stack The error is not fatal, but recoverable. For example you want to kill a process or call hwpoison or do some other recovery action. These generally have to sleep to do anything interesting. You cannot do the sleeping on the exception stack, so you push it to another context. Now just because an error is recoverable doesn't mean it's not critical (I think that was the mistake Boris made). If you don't do something (like killing or recovery) you could end up in a loop or consume corrupted data or something else bad. So the error has to have a fail safe path from detection to handling. That's quite different from logging or performance counting etc. where dropping events on overload is normal and expected. Normally it can be only done by using dedicated resources. -Andi -- ak(a)linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on 24 Jun 2010 11:50 From: Andi Kleen <andi(a)firstfloor.org> Date: Thu, Jun 24, 2010 at 10:01:43AM -0400 > > Please, as Peter and Boris asked you already, quote a concrete, specific > > example: > > It was already in my answer to Peter. > > > > > 'Specific event X occurs, kernel wants/needs to do Y. This cannot be done > > via the suggested method due to Z.' > > > > Your generic arguments look wrong (to the extent they are specified) and it > > makes it much easier and faster to address your points if you dont blur them > > by vagaries. > > It's one of the fundamental properties of recoverable errors. > > Error happens. > Machine check or NMI or other exception happens. > That exception runs on the exception stack > The error is not fatal, but recoverable. > For example you want to kill a process or call hwpoison or do some other > recovery action. These generally have to sleep to do anything > interesting. > You cannot do the sleeping on the exception stack, so you push it to > another context. > > Now just because an error is recoverable doesn't mean it's not critical > (I think that was the mistake Boris made). It wasn't a mistake - I was simply trying to lure you into giving a more concrete example so that we all land on the same page and we know what the heck you/we/all are talking about. > If you don't do something > (like killing or recovery) you could end up in a loop or consume > corrupted data or something else bad. > > So the error has to have a fail safe path from detection to handling. So we are talking about a more involved and "could-sleep" error recovery. > That's quite different from logging or performance counting etc. > where dropping events on overload is normal and expected. So I went back and reread the whole thread, and correct me if I'm wrong but the whole run softirq after NMI has one use case for now - "could-sleep" error handling for MCEs _only_ on x86. So you're changing a bunch of generic and x86 kernel code just for error handling. Hmm, that's a kinda big hammer in my book. A slimmer solution is a much better way to go, IMHO. I think Peter said something about irq_exit(), which should be just fine. But AFAICT an arch-specific solution would be even better, e.g. if you call into your deferred work helper from paranoid_exit in <arch/x86/kernel/entry_64.S>. I.e, something like #ifdef CONFIG_X86_MCE testl $_TIF_NEED_POST_NMI,%ebx jnz do_post_nmi_work #endif Or even slimmer, rewrite the paranoidzeroentry to a MCE-specific variant which does the added functionality. But that wouldn't be extensible if other entities want post-NMI work later. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on 24 Jun 2010 12:10 On Thu, Jun 24, 2010 at 05:41:24PM +0200, Borislav Petkov wrote: > > If you don't do something > > (like killing or recovery) you could end up in a loop or consume > > corrupted data or something else bad. > > > > So the error has to have a fail safe path from detection to handling. > > So we are talking about a more involved and "could-sleep" error > recovery. That's one case, there are other too. > > > That's quite different from logging or performance counting etc. > > where dropping events on overload is normal and expected. > > So I went back and reread the whole thread, and correct me if I'm > wrong but the whole run softirq after NMI has one use case for now - > "could-sleep" error handling for MCEs _only_ on x86. So you're changing Nope, there are multiple use cases. Today it's background MCE and possibly perf if it ever decides to share code with the rest of the kernel instead of wanting to be Bork of Linux. Future ones would be more MCE errors and also non MCE errors like NMIs. > a bunch of generic and x86 kernel code just for error handling. Hmm, > that's a kinda big hammer in my book. Actually no, it would just make the current code slightly cleaner and somewhat more general. But for most cases it works without it. > > A slimmer solution is a much better way to go, IMHO. I think Peter said > something about irq_exit(), which should be just fine. The "slimmer solution" is there, but it has some limitations. I merely said that softirqs would be useful for solving these limitations (but are not strictly needed) Anyways slimmer solution was even originally proposed, just some of the earlier review proposed softirqs instead. So Ying posts softirqs and then he gets now flamed for posting softirqs. Overall there wasn't much consistency in the suggestions, three different reviewers suggested three incompatible approaches. Anyways if there are no softirqs that's fine too, the error handler can probably live with not having that. > But AFAICT an arch-specific solution would be even better, e.g. > if you call into your deferred work helper from paranoid_exit in > <arch/x86/kernel/entry_64.S>. I.e, something like Yes that helps for part of the error handling (in fact this has been implemented), but that does not solve the self interrupt problem which requires delaying until next cli. -Andi -- ak(a)linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Huang Ying on 24 Jun 2010 22:20 On Thu, 2010-06-24 at 14:35 +0800, Peter Zijlstra wrote: > Something like this, but filled out with some arch code that does the > self-ipi and calls irq_work_run() should do. > > No need to molest the softirq code, no need for limited vectors of any > kind. Now, as far as my understanding goes, hard IRQ based solution is acceptable for everyone. Ingo and Andi, Do you agree? > Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl> > --- > include/linux/irq_callback.h | 13 ++++++++ > kernel/irq_callback.c | 66 +++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 79 insertions(+) > > Index: linux-2.6/include/linux/irq_callback.h > =================================================================== > --- /dev/null > +++ linux-2.6/include/linux/irq_callback.h > @@ -0,0 +1,13 @@ > +#ifndef _LINUX_IRQ_CALLBACK_H > +#define _LINUX_IRQ_CALLBACK_H > + > +struct irq_work { > + struct irq_work *next; > + void (*func)(struct irq_work *); > +}; It is better to add "void *data" field in this struct to allow same function can be used for multiple struct irq_work. And I think IRQ is the implementation detail here, so irq_work is probably not a good name. nmi_return_notifier or nmi_callback is better? > +int irq_work_queue(struct irq_work *entry, void (*func)(struct irq_work *)); > +void irq_work_run(void); > +void irq_work_sync(struct irq_work *entry); > + > +#endif /* _LINUX_IRQ_CALLBACK_H */ > Index: linux-2.6/kernel/irq_callback.c > =================================================================== > --- /dev/null > +++ linux-2.6/kernel/irq_callback.c > @@ -0,0 +1,66 @@ > + > +#include <linux/irq_callback.h> > + > +#define CALLBACK_TAIL ((struct irq_work *)-1UL) > + > +static DEFINE_PER_CPU(struct irq_work *, irq_work_list) = { > + CALLBACK_TAIL, > +}; > + > +int irq_work_queue(struct irq_work *entry, void (*func)(struct irq_work *)) > +{ > + struct irq_work **head; > + > + if (cmpxchg(&entry->next, NULL, CALLBACK_TAIL) != NULL) > + return 0; > + > + entry->func = func; > + > + head = &get_cpu_var(irq_work_list); > + > + do { > + entry->next = *head; > + } while (cmpxchg(head, entry->next, entry) != entry->next); > + > + if (entry->next == CALLBACK_TAIL) > + arch_self_ipi(); > + > + put_cpu_var(irq_work_list); > + return 1; > +} > + > +void irq_work_run(void) > +{ > + struct irq_work *list; > + > + list = xchg(&__get_cpu_var(irq_work_list), CALLBACK_TAIL); > + while (list != CALLBACK_TAIL) { > + struct irq_work *entry = list; > + > + list = list->next; > + entry->func(entry); > + > + entry->next = NULL; entry->next = NULL should be put before entry->func(entry), so that we will not lose a notification from NMI. And maybe check irq_work_list for several times to make sure nothing is lost. > + /* > + * matches the mb in cmpxchg() in irq_work_queue() > + */ > + smp_wmb(); > + } > +} I don't know why we need smp_wmb() here and smp_rmb() in irq_work_pending(). The smp_<x>mb() in original perf_pending_xxx code is not necessary too. Because smp_<x>mb is invoked in wake_up_process() and __wait_event() already. > +static int irq_work_pending(struct irq_work *entry) > +{ > + /* > + * matches the wmb in irq_work_run > + */ > + smp_rmb(); > + return entry->next != NULL; > +} > + > +void irq_work_sync(struct irq_work *entry) > +{ > + WARN_ON_ONCE(irqs_disabled()); > + > + while (irq_work_pending(entry)) > + cpu_relax(); > +} If we move entry->next = NULL earlier in irq_work_run(), we need another flag to signify the entry->func is running here. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 25 Jun 2010 04:20
On Fri, 2010-06-25 at 10:12 +0800, Huang Ying wrote: > > It is better to add "void *data" field in this struct to allow same > function can be used for multiple struct irq_work. No, simply do: struct my_foo { struct irq_work work; /* my extra data */ } void my_func(struct irq_work *work) { struct my_foo *foo = container_of(work, struct my_foo, work); /* tada! */ } > And I think IRQ is the implementation detail here, so irq_work is > probably not a good name. nmi_return_notifier or nmi_callback is better? Well, its ran in hard-irq context, so its an irq work. There's nothing that says it can only be used from NMI context. > > +void irq_work_run(void) > > +{ > > + struct irq_work *list; > > + > > + list = xchg(&__get_cpu_var(irq_work_list), CALLBACK_TAIL); > > + while (list != CALLBACK_TAIL) { > > + struct irq_work *entry = list; > > + > > + list = list->next; > > + entry->func(entry); > > + > > + entry->next = NULL; > > entry->next = NULL should be put before entry->func(entry), so that we > will not lose a notification from NMI. And maybe check irq_work_list for > several times to make sure nothing is lost. But then _sync() will return before its done executing. I think clearing after the function is done executing is the only sane semantics (and yes, I should fix the current perf code). You can always miss an NMI since it can always happen before the callback gets done, and allowing another enqueue before the callback is complete is asking for trouble. > > + /* > > + * matches the mb in cmpxchg() in irq_work_queue() > > + */ > > + smp_wmb(); > > + } > > +} > > I don't know why we need smp_wmb() here and smp_rmb() in > irq_work_pending(). The smp_<x>mb() in original perf_pending_xxx code is > not necessary too. Because smp_<x>mb is invoked in wake_up_process() and > __wait_event() already. The smp_wmb() wants to be before ->next = NULL; so that all writes are completed before we release the entry. To that same effect _sync() and _queue need the (r)mb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |