Prev: [PATCH] enable readback to get HPET working on ATI SB4x00, kernel 2.6.35_rc5
Next: input: Fix wrong dimensions check for synaptics
From: Mathieu Desnoyers on 16 Jul 2010 15:20 Hi Linus, What I omitted in my original description paragraph is that I also test for NMIs nested over NMI "regular code" with a "nesting" per-cpu flag, which deals with the concerns you expressed in your reply about function calls and traps. I'm self-replying to keep track of Avi's comment about the need to save/restore cr2 at the beginning/end of the NMI handler, so we don't end up corrupting a VM CR2 if we have the following scenario: trap in VM, NMI, trap in NMI. So I added cr2 awareness to the code snippet below, so we should be close to have something that starts to make sense. (although I'm not saying it's bug-free yet) ;) Please note that I'll be off on vacation for 2 weeks starting this evening (back on August 2) without Internet access, so my answers might be delayed. Thanks ! Mathieu Code originally written by Linus Torvalds, modified by Mathieu Desnoyers intenting to handle the fake NMI entry gracefully given that NMIs are not necessarily disabled at the entry point. It uses a "need fake NMI" flag rather than playing games with CS and faults. When a fake NMI is needed, it simply jumps back to the beginning of regular nmi code. NMI exit code and fake NMI entry are made reentrant with respect to NMI handler interruption by testing, at the very beginning of the NMI handler, if a NMI is nested over the whole nmi_atomic .. nmi_atomic_end code region. It also tests for nested NMIs by keeping a per-cpu "nmi nested" flag"; this deals with detection of nesting over the "regular nmi" execution. This code assumes NMIs have a separate stack. # # Two per-cpu variables: a "are we nested" flag (one byte). # a "do we need to execute a fake NMI" flag (one byte). # The %rsp at which the stack copy is saved is at a fixed address, which leaves # enough room at the bottom of NMI stack for the "real" NMI entry stack. This # assumes we have a separate NMI stack. # The NMI stack copy top of stack is at nmi_stack_copy. # The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to # nmi_stack_copy-40. # nmi: # Test if nested over atomic code. cmpq $nmi_atomic,0(%rsp) jae nmi_addr_is_ae # Test if nested over general NMI code. cmpb $0,%__percpu_seg:nmi_stack_nesting jne nmi_nested_set_fake_and_return # create new stack is_unnested_nmi: # Save some space for nested NMI's. The exception itself # will never use more space, but it might use less (since # if will be a kernel-kernel transition). # Save %rax on top of the stack (need to temporarily use it) pushq %rax movq %rsp, %rax movq $nmi_stack_copy,%rsp # copy the five words of stack info. rip starts at 8+0(%rax). # cr2 is saved at nmi_stack_copy_rip+40 pushq %cr2 # save cr2 to handle nesting over page faults pushq 8+32(%rax) # ss pushq 8+24(%rax) # rsp pushq 8+16(%rax) # eflags pushq 8+8(%rax) # cs pushq 8+0(%rax) # rip movq 0(%rax),%rax # restore %rax set_nmi_nesting: # and set the nesting flags movb $0xff,%__percpu_seg:nmi_stack_nesting regular_nmi_code: ... # regular NMI code goes here, and can take faults, # because this sequence now has proper nested-nmi # handling ... nmi_atomic: # An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will # be handled specially. This includes the fake NMI entry point. cmpb $0,%__percpu_seg:need_fake_nmi jne fake_nmi movb $0,%__percpu_seg:nmi_stack_nesting # restore cr2 movq %nmi_stack_copy_rip+40,%cr2 iret # This is the fake NMI entry point. fake_nmi: movb $0x0,%__percpu_seg:need_fake_nmi jmp regular_nmi_code nmi_atomic_end: # Make sure the address is in the nmi_atomic range and in CS segment. nmi_addr_is_ae: cmpq $nmi_atomic_end,0(%rsp) jae is_unnested_nmi # The saved rip points to the final NMI iret. Check the CS segment to # make sure. cmpw $__KERNEL_CS,8(%rsp) jne is_unnested_nmi # This is the case when we hit just as we're supposed to do the atomic code # of a previous nmi. We run the NMI using the old return address that is still # on the stack, rather than copy the new one that is bogus and points to where # the nested NMI interrupted the original NMI handler! # Easy: just set the stack pointer to point to the stack copy, clear # need_fake_nmi (because we are directly going to execute the requested NMI) and # jump to "nesting flag set" (which is followed by regular nmi code execution). movq $nmi_stack_copy_rip,%rsp movb $0x0,%__percpu_seg:need_fake_nmi jmp set_nmi_nesting # This is the actual nested case. Make sure we branch to the fake NMI handler # after this handler is done. nmi_nested_set_fake_and_return: movb $0xff,%__percpu_seg:need_fake_nmi popfq jmp *(%rsp) -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on 18 Jul 2010 07:10 On 07/15/2010 04:23 AM, Linus Torvalds wrote: > On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds > <torvalds(a)linux-foundation.org> wrote: > >> I think the %rip check should be pretty simple - exactly because there >> is only a single point where the race is open between that 'mov' and >> the 'iret'. So it's simpler than the (similar) thing we do for >> debug/nmi stack fixup for sysenter that has to check a range. >> > So this is what I think it might look like, with the %rip in place. > And I changed the "nmi_stack_ptr" thing to have both the pointer and a > flag - because it turns out that in the single-instruction race case, > we actually want the old pointer. > > Totally untested, of course. But _something_ like this might work: > > # > # Two per-cpu variables: a "are we nested" flag (one byte), and > # a "if we're nested, what is the %rsp for the nested case". > # > # The reason for why we can't just clear the saved-rsp field and > # use that as the flag is that we actually want to know the saved > # rsp for the special case of having a nested NMI happen on the > # final iret of the unnested case. > # > nmi: > cmpb $0,%__percpu_seg:nmi_stack_nesting > jne nmi_nested_corrupt_and_return > cmpq $nmi_iret_address,0(%rsp) > je nmi_might_be_nested > # create new stack > is_unnested_nmi: > # Save some space for nested NMI's. The exception itself > # will never use more space, but it might use less (since > # if will be a kernel-kernel transition). But the nested > # exception will want two save registers and a place to > # save the original CS that it will corrupt > subq $64,%rsp > > # copy the five words of stack info. 96 = 64 + stack > # offset of ss. > pushq 96(%rsp) # ss > pushq 96(%rsp) # rsp > pushq 96(%rsp) # eflags > pushq 96(%rsp) # cs > pushq 96(%rsp) # rip > > # and set the nesting flags > movq %rsp,%__percpu_seg:nmi_stack_ptr > movb $0xff,%__percpu_seg:nmi_stack_nesting > > By trading off some memory, we don't need this trickery. We can allocate two nmi stacks, so the code becomes: nmi: cmpb $0, %__percpu_seg:nmi_stack_nesting je unnested_nmi cmpq $nmi_iret,(%rsp) jne unnested_nmi cmpw $__KERNEL_CS,8(%rsp) jne unnested_nmi popf retfq unnested_nmi: xorq $(nmi_stack_1 ^ nmi_stack_2),%__percpu_seg:tss_nmi_ist_entry movb $1, __percpu_seg:nmi_stack_nesting regular_nmi: ... regular_nmi_end: movb $0, __percpu_seg:nmi_stack_nesting nmi_iret: iretq -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 18 Jul 2010 13:40 On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity <avi(a)redhat.com> wrote: > > By trading off some memory, we don't need this trickery. �We can allocate > two nmi stacks, so the code becomes: I really don't think you need even that. See earlier in the discussion about how we could just test %rsp itself. Which makes all the %rip testing totally unnecessary, because we don't even need any flags,and we have no races because %rsp is atomically changed with taking the exception. Lookie here, the %rsp comparison really isn't that hard: nmi: pushq %rax pushq %rdx movq %rsp,%rdx # current stack top movq 40(%rsp),%rax # old stack top xor %rax,%rdx # same 8kB aligned area? shrq $13,%rdx # ignore low 13 bits je it_is_a_nested_nmi # looks nested.. non_nested: ... ... ok, we're not nested, do normal NMI handling ... ... popq %rdx popq %rax iret it_is_a_nested_nmi: cmpw $0,48(%rsp) # double-check that it really was a nested exception jne non_nested # from user space or something.. # this is the nested case # NOTE! NMI's are blocked, we don't take any exceptions etc etc addq $-160,%rax # 128-byte redzone on the old stack + 4 words movq (%rsp),%rdx movq %rdx,(%rax) # old %rdx movq 8(%rsp),%rdx movq %rdx,8(%rax) # old %rax movq 32(%rsp),%rdx movq %rdx,16(%rax) # old %rflags movq 16(%rsp),%rdx movq %rdx,24(%rax) # old %rip movq %rax,%rsp popq %rdx popq %rax popf ret $128 # restore %rip and %rsp doesn't that look pretty simple? NOTE! OBVIOUSLY TOTALLY UNTESTED! Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on 18 Jul 2010 14:10 On 07/18/2010 08:36 PM, Linus Torvalds wrote: > On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity<avi(a)redhat.com> wrote: > >> By trading off some memory, we don't need this trickery. We can allocate >> two nmi stacks, so the code becomes: >> > I really don't think you need even that. See earlier in the discussion > about how we could just test %rsp itself. Which makes all the %rip > testing totally unnecessary, because we don't even need any flags,and > we have no races because %rsp is atomically changed with taking the > exception. > > Lookie here, the %rsp comparison really isn't that hard: > > nmi: > pushq %rax > pushq %rdx > movq %rsp,%rdx # current stack top > movq 40(%rsp),%rax # old stack top > xor %rax,%rdx # same 8kB aligned area? > shrq $13,%rdx # ignore low 13 bits > je it_is_a_nested_nmi # looks nested.. > > .... > doesn't that look pretty simple? > > Too simple - an MCE will switch to its own stack, failing the test. Now that we have correctable MCEs, that's not a good idea. So the plain everyday sequence NMI #PF MCE (uncompleted) NMI will fail. Plus, even in the non-nested case, you have to copy the stack frame, or the nested NMI will corrupt it. With stack switching, the nested NMI is allocated its own frame. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 18 Jul 2010 14:20
On Sun, Jul 18, 2010 at 10:36 AM, Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > Lookie here, the %rsp comparison really isn't that hard: A few notes on that (still) untested code suggestion: > �nmi: > � � �pushq %rax > � � �pushq %rdx > � � �movq %rsp,%rdx � � � � �# current stack top > � � �movq 40(%rsp),%rax � # old stack top > � � �xor %rax,%rdx � � � � � � �# same 8kB aligned area? > � � �shrq $13,%rdx � � � � � � # ignore low 13 bits > � � �je it_is_a_nested_nmi � # looks nested.. > �non_nested: > � � �... > � � �... ok, we're not nested, do normal NMI handling ... > � � �... The non_nested case still needs to start off with moving it's stack frame to a safe area that won't be overwritten by any nesting NMI's (note that they cannot nest at this point, since we've done nothing that can fault). So we'd still need that 7* pushq 48(%rsp) which copies the five words that got pushed by hardware, and the two register-save locations that we used for the nesting check and special return. After we've done those 7 pushes, we can then run code that may take a fault. Because when the fault returns with an "iret" and re-enables NMI's, our nesting code is ready. So all told, we need a maximum of about 216 bytes of stack for the nested NMI case: 56 bytes for the seven copied words, and the 160 bytes that we build up _under_ the stack pointer for the nested case. And we need the NMI stack itself to be aligned in order for that "ignore low bits" check to work. Although we don't actually have to do that "xor+shr", we could do the test equally well with a "sub+unsigned compare against stack size". Other than that, I think the extra test that we're really nested might better be done differently: > �it_is_a_nested_nmi: > � � �cmpw $0,48(%rsp) � � # double-check that it really was a nested exception > � � �jne non_nested � � � � � # from user space or something.. > � � �# this is the nested case It migth be safer to check the saved CS rather than the saved SS on the stack to see that we really are in kernel mode. It's possible that somebody could load a NULL SS in user mode and then just not use the stack - and try to make it look like they are in kernel mode for when the NMI happens. Now, I _think_ that loading a zero SS is supposed to trap, but checking CS is still likely to be the better test for "were we in kernel mode". That's where the CPL is really encoded, after all. So that "cmpw $0,48(%rsp)" is probably ok, but it would likely be better to do it as testl $3,24(%rsp) jne non_nested instead. That's what entry_64.S does everywhere else. Oh, and the non-nested case obviously needs all the regular "make the kernel state look right" code. Like the swapgs stuff etc if required. My example code was really meant to just document the nesting handling, not the existing stuff we already need to do with save_paranoid etc. And I really think it should work, but I'd again like to stress that it's just a RFD code sequence with no testing what-so-ever etc. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |