Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics
From: Linus Torvalds on 14 Jul 2010 12:30 On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote: >> I think you're vastly overestimating what is sane to do from an NMI >> context. �It is utterly and totally insane to assume vmalloc is available >> in NMI. I agree that NMI handlers shouldn't touch vmalloc space. But now that percpu data is mapped through the VM, I do agree that other CPU's may potentially need to touch that data, and an interrupt (including an NMI) might be the first to create the mapping. And that's why the faulting code needs to be interrupt-safe for the vmalloc area. However, it does look like the current scheduler should make it safe to access "current->mm->pgd" from regular interrupts, so the problem is apparently only an NMI issue. So exactly what are the circumstances that create and expose percpu data on a CPU _without_ mapping it on that CPU? IOW, I'm missing some background here. I agree that at least some basic percpu data should generally be available for an NMI handler, but at the same time I wonder how come that basic percpu data wasn't already mapped? The traditional irq vmalloc race was something like: - one CPU does a "fork()", which copies the basic kernel mappings - in another thread a driver does a vmalloc(), which creates a _new_ mapping that didn't get copied. - later on a switch_to() switches to the newly forked process that missed the vmalloc initialization - we take an interrupt for the driver that needed the new vmalloc space, but now it doesn't have it, and we fill it in at run-time for the (rare) race. and I'm simply not seeing how fork() could ever race with percpu data setup. So please just document the sequence that actually needs the page table setup for the NMI/percpu case. This patch (1/2) doesn't look horrible per se. I have no problems with it. I just want to understand why it is needed. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mathieu Desnoyers on 14 Jul 2010 13:10 * Linus Torvalds (torvalds(a)linux-foundation.org) wrote: > On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers > <mathieu.desnoyers(a)efficios.com> wrote: (I was quoting Peter Anvin below) ;) > >> I think you're vastly overestimating what is sane to do from an NMI > >> context. �It is utterly and totally insane to assume vmalloc is available > >> in NMI. > > I agree that NMI handlers shouldn't touch vmalloc space. But now that > percpu data is mapped through the VM, I do agree that other CPU's may > potentially need to touch that data, and an interrupt (including an > NMI) might be the first to create the mapping. > [...] > So please just document the sequence that actually needs the page > table setup for the NMI/percpu case. > > This patch (1/2) doesn't look horrible per se. I have no problems with > it. I just want to understand why it is needed. The problem originally addressed by this patch is the case where a NMI handler try to access vmalloc'd per-cpu data, which goes as follow: - One CPU does a fork(), which copies the basic kernel mappings. - Perf allocates percpu memory for buffer control data structures. This mapping does not get copied. - Tracing is activated. - switch_to() to the newly forked process which missed the new percpu allocation. - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing handler, therefore leading to a page fault in NMI context. Here, we might be in the middle of switch_to(), where ->current might not be in sync with the current cr3 register. The three choices we have to handle this that I am aware of are: 1) supporting page faults in NMI context, which imply removing ->current dependency and supporting iret-less return path. 2) duplicating the percpu alloc API with a variant that maps to kmalloc. 3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64, not x86_32). Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort (involves API duplication and reservation of a fixed-amount of per-cpu memory at boot). Hence the proposal of choice 1. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 14 Jul 2010 14:20 On Wed, Jul 14, 2010 at 10:06 AM, Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> wrote: >> >> This patch (1/2) doesn't look horrible per se. I have no problems with >> it. I just want to understand why it is needed. [ And patch 2/2 is much more intrusive, and touches a critical path too.. If it was just the 1/2 series, I don't think I would care. For the 2/2, I think I'd want to explore all the alternative options ] > The problem originally addressed by this patch is the case where a NMI handler > try to access vmalloc'd per-cpu data, which goes as follow: > > - One CPU does a fork(), which copies the basic kernel mappings. > - Perf allocates percpu memory for buffer control data structures. > �This mapping does not get copied. > - Tracing is activated. > - switch_to() to the newly forked process which missed the new percpu > �allocation. > - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing > �handler, therefore leading to a page fault in NMI context. Here, we might be > �in the middle of switch_to(), where ->current might not be in sync with the > �current cr3 register. Ok. I was wondering why anybody would allocate core percpu variables so late that this would ever be an issue, but I guess perf is a reasonable such case. And reasonable to do from NMI. That said - grr. I really wish there was some other alternative than adding yet more complexity to the exception return path. That "iret re-enables NMI's unconditionally" thing annoys me. In fact, I wonder if we couldn't just do a software NMI disable instead? Hav ea per-cpu variable (in the _core_ percpu areas that get allocated statically) that points to the NMI stack frame, and just make the NMI code itself do something like NMI entry: - load percpu NMI stack frame pointer - if non-zero we know we're nested, and should ignore this NMI: - we're returning to kernel mode, so return immediately by using "popf/ret", which also keeps NMI's disabled in the hardware until the "real" NMI iret happens. - before the popf/iret, use the NMI stack pointer to make the NMI return stack be invalid and cause a fault - set the NMI stack pointer to the current stack pointer NMI exit (not the above "immediate exit because we nested"): clear the percpu NMI stack pointer Just do the iret. Now, the thing is, now the "iret" is atomic. If we had a nested NMI, we'll take a fault, and that re-does our "delayed" NMI - and NMI's will stay masked. And if we didn't have a nested NMI, that iret will now unmask NMI's, and everything is happy. Doesn't the above sound like a good solution? In other words, we solve the whole problem by simply _fixing_ the crazy Intel "iret-vs-NMI" semantics. And we don't need to change the hotpath, and we'll just _allow_ nested faults within NMI's. Am I missing something? Maybe I'm not as clever as I think I am... But I _feel_ clever. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 14 Jul 2010 14:50 * Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > Ok. I was wondering why anybody would allocate core percpu variables so late > that this would ever be an issue, but I guess perf is a reasonable such > case. And reasonable to do from NMI. Yeah. Frederic (re-)discovered this problem via very hard to debug crashes when he extended perf call-graph tracing to have a bit larger buffer and used percpu_alloc() for it (which is entirely reasonable in itself). > That said - grr. I really wish there was some other alternative than adding > yet more complexity to the exception return path. That "iret re-enables > NMI's unconditionally" thing annoys me. Ok. We can solve it by allocating the space from the non-vmalloc percpu area - 8K per CPU. > In fact, I wonder if we couldn't just do a software NMI disable > instead? Hav ea per-cpu variable (in the _core_ percpu areas that get > allocated statically) that points to the NMI stack frame, and just > make the NMI code itself do something like > > NMI entry: I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel stack already, due to entering via the IST stack mechanism, which is non-nesting and which enters at the same point - right? We could solve that by copying that small stack frame off before entering the 'generic' NMI routine - but it all feels a bit pulled in by the hair. I feel uneasy about taking pagefaults from the NMI handler. Even if we implemented it all correctly, who knows what CPU erratas are waiting there to be discovered, etc ... I think we should try to muddle through by preventing these situations from happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would certainly help as well), and only go to more clever schemes if no other option looks sane anymore? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 14 Jul 2010 15:20
On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote: >> �NMI entry: > > I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel > stack already, due to entering via the IST stack mechanism, which is > non-nesting and which enters at the same point - right? Yeah, you're right, but we could easily fix that up. We know we don't need any stack for the nested case, so all we would need to do is to just subtract a small bit off %rsp, and copy the three words or so to create a "new" stack for the non-nested case. > We could solve that by copying that small stack frame off before entering the > 'generic' NMI routine - but it all feels a bit pulled in by the hair. Why? It's much cleaner than making the _real_ codepaths much worse. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |