Prev: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time.
Next: [PATCH v2 3/11] Enhance replace_page() to support pagecache
From: Vivek Goyal on 31 Mar 2010 12:00 On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote: > Flush iommu during shutdown > > When using an iommu, its possible, if a kdump kernel boot follows a primary > kernel crash, that dma operations might still be in flight from the previous > kernel during the kdump kernel boot. This can lead to memory corruption, > crashes, and other erroneous behavior, specifically I've seen it manifest during > a kdump boot as endless iommu error log entries of the form: > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d > address=0x000000000245a0c0 flags=0x0070] > > Followed by an inability to access hard drives, and various other resources. > > I've written this fix for it. In short it just forces a flush of the in flight > dma operations on shutdown, so that the new kernel is certain not to have any > in-flight dmas trying to complete after we've reset all the iommu page tables, > causing the above errors. I've tested it and it fixes the problem for me quite > well. CCing Eric also. Neil, this is interesting. In the past we noticed similar issues, especially on PPC. But I was told that we could not clear the iommu mapping entries as we had no control on in flight DMA and if a DMA comes later after clearing an entry and entry is not present, it is an error. Hence one of the suggestions was not to clear iommu mapping entries but reserve some for kdump operation and use those in kdump kernel. So this call amd_iommu_flush_all_devices() will be able to tell devices that don't do any more DMAs and hence it is safe to reprogram iommu mapping entries. Thanks Vivek > > Signed-off-by: Neil Horman <nhorman(a)tuxdriver.com> > > > amd_iommu_init.c | 25 ++++++++++++++++++++++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kernel/amd_iommu_init.c b/arch/x86/kernel/amd_iommu_init.c > index 9dc91b4..8fbdf58 100644 > --- a/arch/x86/kernel/amd_iommu_init.c > +++ b/arch/x86/kernel/amd_iommu_init.c > @@ -265,8 +265,26 @@ static void iommu_enable(struct amd_iommu *iommu) > iommu_feature_enable(iommu, CONTROL_IOMMU_EN); > } > > -static void iommu_disable(struct amd_iommu *iommu) > +static void iommu_disable(struct amd_iommu *iommu, bool flush) > { > + > + /* > + * This ensures that all in-flight dmas for this iommu > + * are complete prior to shutting it down > + * its a bit racy, but I think its ok, given that if we're flushing > + * we're in a shutdown path (either a graceful shutdown or a > + * crash leading to a kdump boot. That means we're down to one > + * cpu, and the other system hardware isn't going to issue > + * subsequent dma operations. > + * Also note that we gate the flusing on the flush boolean because > + * the enable_iommus path uses this function and we can't flush any > + * data in that path until later when the iommus are fully initialized > + */ > + if (flush) { > + amd_iommu_flush_all_devices(); > + amd_iommu_flush_all_domains(); > + } > + > /* Disable command buffer */ > iommu_feature_disable(iommu, CONTROL_CMDBUF_EN); > > @@ -276,6 +294,7 @@ static void iommu_disable(struct amd_iommu *iommu) > > /* Disable IOMMU hardware itself */ > iommu_feature_disable(iommu, CONTROL_IOMMU_EN); > + > } > > /* > @@ -1114,7 +1133,7 @@ static void enable_iommus(void) > struct amd_iommu *iommu; > > for_each_iommu(iommu) { > - iommu_disable(iommu); > + iommu_disable(iommu, false); > iommu_set_device_table(iommu); > iommu_enable_command_buffer(iommu); > iommu_enable_event_buffer(iommu); > @@ -1129,7 +1148,7 @@ static void disable_iommus(void) > struct amd_iommu *iommu; > > for_each_iommu(iommu) > - iommu_disable(iommu); > + iommu_disable(iommu, true); > } > > /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Neil Horman on 31 Mar 2010 14:30 On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote: > On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote: > > Flush iommu during shutdown > > > > When using an iommu, its possible, if a kdump kernel boot follows a primary > > kernel crash, that dma operations might still be in flight from the previous > > kernel during the kdump kernel boot. This can lead to memory corruption, > > crashes, and other erroneous behavior, specifically I've seen it manifest during > > a kdump boot as endless iommu error log entries of the form: > > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d > > address=0x000000000245a0c0 flags=0x0070] > > > > Followed by an inability to access hard drives, and various other resources. > > > > I've written this fix for it. In short it just forces a flush of the in flight > > dma operations on shutdown, so that the new kernel is certain not to have any > > in-flight dmas trying to complete after we've reset all the iommu page tables, > > causing the above errors. I've tested it and it fixes the problem for me quite > > well. > > CCing Eric also. > > Neil, this is interesting. In the past we noticed similar issues, > especially on PPC. But I was told that we could not clear the iommu > mapping entries as we had no control on in flight DMA and if a DMA comes > later after clearing an entry and entry is not present, it is an error. > Yes, the problem is (as I understand it) is that the triggering of DMA operations to/from a device doesn't have synchronization with the iommu itself. I.e. to conduct a dma you have to: 1) map the in-memory buffer to a dma address using something like pci_map_single. This results (in systems with an iommu) getting page table space allocated in the iommu for the translation. 2) triggering the dma to/from the device by tickling whatever hardware the device has mapped. 3) completing the dma by calling pci_unmap_single (or other function) which frees the page table space in the iommu The problem, exactly as you indicate is that on a kdump panic, we might boot the new kernel and re-enable the iommu with these dmas still in flight. If we start messing about with the iommu page tables then, we start getting all sorts of errors, and other various failures. > Hence one of the suggestions was not to clear iommu mapping entries but > reserve some for kdump operation and use those in kdump kernel. > Yeah, thats a solution, but it seems awfully complex to me. To do that, we need to teach every iommu we support about kdump, by telling it how much space to reserve, and when to use it and when not to (i.e. we'd have to tell it to use the kdump space, vs the normal space dependent on the status of the reset_devices flag, or something equally unpleasant). Actually, thinking about it, I'm not sure that will even work, as IIRC the iommu only has one page table base pointer. So we would either need to re-write that pointer to point into the kdump kernels memory space (invalidating the old table entries, which perpetuates this bug), or we would need to further enhance the iommu code to be able to access the old page tables via read_from_oldmem/write_to_oldmem when booting a kdump kernel, wouldn't we? Using this method, all we really do is try to ensure that, prior to disabling the iommu, we make sure that any pending dmas are complete. That way, when we re-enable the iommu in the kdump kernel, we can safely maniuplate the new page tables, knowing that no pending dma is using them In fairness to this debate, my proposal does have a small race condition. In the above sequence, because the cpu triggers a dma independently of the setup of the mapping in the iommu, it is possible that a dma might be triggered immediately after we flush the iotlb, which may leave an in-flight dma pending while we boot the kdump kernel. In practice though, this will never happen. By the time we arrive at this code, we've already executed native_machine_crash_shutdown which: 1) halts all the other cpus in the system 2) disables local interrupts Because of those two events, we're effectively on a path that we can't be preempted-from. So as long as we don't trigger any dma operations between our return from iommu_shutdown and machine_kexec (which is the next call), we're safe. > So this call amd_iommu_flush_all_devices() will be able to tell devices > that don't do any more DMAs and hence it is safe to reprogram iommu > mapping entries. > It blocks the cpu until any pending DMA operations are complete. Hmm, as I think about it, there is still a small possibility that a device like a NIC which has several buffers pre-dma-mapped could start a new dma before we completely disabled the iommu, althought thats small. I never saw that in my testing, but hitting that would be fairly difficult I think, since its literally just a few hundred cycles between the flush and the actual hardware disable operation. According to this though: http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf That window could be closed fairly easily, but simply disabling read and write permissions for each device table entry prior to calling flush. If we do that, then flush the device table, any subsequently started dma operation would just get noted in the error log, which we could ignore, since we're abot to boot to the kdump kernel anyway. Would you like me to respin w/ that modification? Neil > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric W. Biederman on 31 Mar 2010 14:50 Vivek Goyal <vgoyal(a)redhat.com> writes: > On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote: >> Flush iommu during shutdown >> >> When using an iommu, its possible, if a kdump kernel boot follows a primary >> kernel crash, that dma operations might still be in flight from the previous >> kernel during the kdump kernel boot. This can lead to memory corruption, >> crashes, and other erroneous behavior, specifically I've seen it manifest during >> a kdump boot as endless iommu error log entries of the form: >> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d >> address=0x000000000245a0c0 flags=0x0070] >> >> Followed by an inability to access hard drives, and various other resources. >> >> I've written this fix for it. In short it just forces a flush of the in flight >> dma operations on shutdown, so that the new kernel is certain not to have any >> in-flight dmas trying to complete after we've reset all the iommu page tables, >> causing the above errors. I've tested it and it fixes the problem for me quite >> well. > > CCing Eric also. Thanks. > Neil, this is interesting. In the past we noticed similar issues, > especially on PPC. But I was told that we could not clear the iommu > mapping entries as we had no control on in flight DMA and if a DMA comes > later after clearing an entry and entry is not present, it is an error. Which is exactly what the reported error looks like. > Hence one of the suggestions was not to clear iommu mapping entries but > reserve some for kdump operation and use those in kdump kernel. > > So this call amd_iommu_flush_all_devices() will be able to tell devices > that don't do any more DMAs and hence it is safe to reprogram iommu > mapping entries. I took a quick look at our crash shutdown path and I am very disappointed with the way it has gone lately. Regardless of the merits flushing an iommu versus doing things with an iommu I don't see how we are in any better position in the crash kernel then we are in the kdump kernel. So what are we doing touching it in the kdump path? Likewise for the hpet. We also seem to be at a point where if we have a tsc we don't need to enable interrupts until we are ready to enable them in native mode. And except for a few weird SMP 486's tsc and apics came in at the same time. So my grumpy code review says we should gut crash.c (like below) and fix the initialization paths so they do the right thing. --- arch/x86/kernel/crash.c | 18 ------------------ 1 files changed, 0 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index a4849c1..8e33c50 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -22,12 +22,10 @@ #include <asm/nmi.h> #include <asm/hw_irq.h> #include <asm/apic.h> -#include <asm/hpet.h> #include <linux/kdebug.h> #include <asm/cpu.h> #include <asm/reboot.h> #include <asm/virtext.h> -#include <asm/x86_init.h> #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC) @@ -56,15 +54,11 @@ static void kdump_nmi_callback(int cpu, struct die_args *args) */ cpu_emergency_vmxoff(); cpu_emergency_svm_disable(); - - disable_local_APIC(); } static void kdump_nmi_shootdown_cpus(void) { nmi_shootdown_cpus(kdump_nmi_callback); - - disable_local_APIC(); } #else @@ -96,17 +90,5 @@ void native_machine_crash_shutdown(struct pt_regs *regs) cpu_emergency_vmxoff(); cpu_emergency_svm_disable(); - lapic_shutdown(); -#if defined(CONFIG_X86_IO_APIC) - disable_IO_APIC(); -#endif -#ifdef CONFIG_HPET_TIMER - hpet_disable(); -#endif - -#ifdef CONFIG_X86_64 - x86_platform.iommu_shutdown(); -#endif - crash_save_cpu(regs, safe_smp_processor_id()); } -- 1.6.5.2.143.g8cc62 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric W. Biederman on 31 Mar 2010 15:00 Neil Horman <nhorman(a)tuxdriver.com> writes: > On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote: >> So this call amd_iommu_flush_all_devices() will be able to tell devices >> that don't do any more DMAs and hence it is safe to reprogram iommu >> mapping entries. >> > It blocks the cpu until any pending DMA operations are complete. Hmm, as I > think about it, there is still a small possibility that a device like a NIC > which has several buffers pre-dma-mapped could start a new dma before we > completely disabled the iommu, althought thats small. I never saw that in my > testing, but hitting that would be fairly difficult I think, since its literally > just a few hundred cycles between the flush and the actual hardware disable > operation. > > According to this though: > http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf > That window could be closed fairly easily, but simply disabling read and write > permissions for each device table entry prior to calling flush. If we do that, > then flush the device table, any subsequently started dma operation would just > get noted in the error log, which we could ignore, since we're abot to boot to > the kdump kernel anyway. > > Would you like me to respin w/ that modification? Disabling permissions on all devices sounds good for the new virtualization capable iommus. I think older iommus will still be challenged. I think on x86 we have simply been able to avoid using those older iommus. I like the direction you are going but please let's put this in a paranoid iommu enable routine. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Neil Horman on 31 Mar 2010 15:20
On Wed, Mar 31, 2010 at 11:57:46AM -0700, Eric W. Biederman wrote: > Neil Horman <nhorman(a)tuxdriver.com> writes: > > > On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote: > > >> So this call amd_iommu_flush_all_devices() will be able to tell devices > >> that don't do any more DMAs and hence it is safe to reprogram iommu > >> mapping entries. > >> > > It blocks the cpu until any pending DMA operations are complete. Hmm, as I > > think about it, there is still a small possibility that a device like a NIC > > which has several buffers pre-dma-mapped could start a new dma before we > > completely disabled the iommu, althought thats small. I never saw that in my > > testing, but hitting that would be fairly difficult I think, since its literally > > just a few hundred cycles between the flush and the actual hardware disable > > operation. > > > > According to this though: > > http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf > > That window could be closed fairly easily, but simply disabling read and write > > permissions for each device table entry prior to calling flush. If we do that, > > then flush the device table, any subsequently started dma operation would just > > get noted in the error log, which we could ignore, since we're abot to boot to > > the kdump kernel anyway. > > > > Would you like me to respin w/ that modification? > > Disabling permissions on all devices sounds good for the new virtualization > capable iommus. I think older iommus will still be challenged. I think > on x86 we have simply been able to avoid using those older iommus. > > I like the direction you are going but please let's put this in a > paranoid iommu enable routine. > You mean like initialize the device table so that all devices are default disabled on boot, and then selectively enable them (perhaps during a device_attach)? I can give that a spin. Neil > Eric > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo(a)vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |