amd iommu: force flush of iommu prior during shutdown [Kernel]

Prev: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time.
Next: [PATCH v2 3/11] Enhance replace_page() to support pagecache

From: Vivek Goyal on 31 Mar 2010 12:00

On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote:
> Flush iommu during shutdown
>
> When using an iommu, its possible, if a kdump kernel boot follows a primary
> kernel crash, that dma operations might still be in flight from the previous
> kernel during the kdump kernel boot. This can lead to memory corruption,
> crashes, and other erroneous behavior, specifically I've seen it manifest during
> a kdump boot as endless iommu error log entries of the form:
> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> address=0x000000000245a0c0 flags=0x0070]
>
> Followed by an inability to access hard drives, and various other resources.
>
> I've written this fix for it. In short it just forces a flush of the in flight
> dma operations on shutdown, so that the new kernel is certain not to have any
> in-flight dmas trying to complete after we've reset all the iommu page tables,
> causing the above errors. I've tested it and it fixes the problem for me quite
> well.

CCing Eric also.

Neil, this is interesting. In the past we noticed similar issues,
especially on PPC. But I was told that we could not clear the iommu
mapping entries as we had no control on in flight DMA and if a DMA comes
later after clearing an entry and entry is not present, it is an error.

Hence one of the suggestions was not to clear iommu mapping entries but
reserve some for kdump operation and use those in kdump kernel.

So this call amd_iommu_flush_all_devices() will be able to tell devices
that don't do any more DMAs and hence it is safe to reprogram iommu
mapping entries.

Thanks
Vivek

>
> Signed-off-by: Neil Horman <nhorman(a)tuxdriver.com>
>
>
> amd_iommu_init.c | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/amd_iommu_init.c b/arch/x86/kernel/amd_iommu_init.c
> index 9dc91b4..8fbdf58 100644
> --- a/arch/x86/kernel/amd_iommu_init.c
> +++ b/arch/x86/kernel/amd_iommu_init.c
> @@ -265,8 +265,26 @@ static void iommu_enable(struct amd_iommu *iommu)
> iommu_feature_enable(iommu, CONTROL_IOMMU_EN);
> }
>
> -static void iommu_disable(struct amd_iommu *iommu)
> +static void iommu_disable(struct amd_iommu *iommu, bool flush)
> {
> +
> + /*
> + * This ensures that all in-flight dmas for this iommu
> + * are complete prior to shutting it down
> + * its a bit racy, but I think its ok, given that if we're flushing
> + * we're in a shutdown path (either a graceful shutdown or a
> + * crash leading to a kdump boot. That means we're down to one
> + * cpu, and the other system hardware isn't going to issue
> + * subsequent dma operations.
> + * Also note that we gate the flusing on the flush boolean because
> + * the enable_iommus path uses this function and we can't flush any
> + * data in that path until later when the iommus are fully initialized
> + */
> + if (flush) {
> + amd_iommu_flush_all_devices();
> + amd_iommu_flush_all_domains();
> + }
> +
> /* Disable command buffer */
> iommu_feature_disable(iommu, CONTROL_CMDBUF_EN);
>
> @@ -276,6 +294,7 @@ static void iommu_disable(struct amd_iommu *iommu)
>
> /* Disable IOMMU hardware itself */
> iommu_feature_disable(iommu, CONTROL_IOMMU_EN);
> +
> }
>
> /*
> @@ -1114,7 +1133,7 @@ static void enable_iommus(void)
> struct amd_iommu *iommu;
>
> for_each_iommu(iommu) {
> - iommu_disable(iommu);
> + iommu_disable(iommu, false);
> iommu_set_device_table(iommu);
> iommu_enable_command_buffer(iommu);
> iommu_enable_event_buffer(iommu);
> @@ -1129,7 +1148,7 @@ static void disable_iommus(void)
> struct amd_iommu *iommu;
>
> for_each_iommu(iommu)
> - iommu_disable(iommu);
> + iommu_disable(iommu, true);
> }
>
> /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Horman on 31 Mar 2010 14:30

On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote:
> On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote:
> > Flush iommu during shutdown
> >
> > When using an iommu, its possible, if a kdump kernel boot follows a primary
> > kernel crash, that dma operations might still be in flight from the previous
> > kernel during the kdump kernel boot. This can lead to memory corruption,
> > crashes, and other erroneous behavior, specifically I've seen it manifest during
> > a kdump boot as endless iommu error log entries of the form:
> > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
> > address=0x000000000245a0c0 flags=0x0070]
> >
> > Followed by an inability to access hard drives, and various other resources.
> >
> > I've written this fix for it. In short it just forces a flush of the in flight
> > dma operations on shutdown, so that the new kernel is certain not to have any
> > in-flight dmas trying to complete after we've reset all the iommu page tables,
> > causing the above errors. I've tested it and it fixes the problem for me quite
> > well.
>
> CCing Eric also.
>
> Neil, this is interesting. In the past we noticed similar issues,
> especially on PPC. But I was told that we could not clear the iommu
> mapping entries as we had no control on in flight DMA and if a DMA comes
> later after clearing an entry and entry is not present, it is an error.
>
Yes, the problem is (as I understand it) is that the triggering of DMA
operations to/from a device doesn't have synchronization with the iommu itself.
I.e. to conduct a dma you have to:

1) map the in-memory buffer to a dma address using something like
pci_map_single. This results (in systems with an iommu) getting page table
space allocated in the iommu for the translation.

2) triggering the dma to/from the device by tickling whatever hardware the
device has mapped.

3) completing the dma by calling pci_unmap_single (or other function) which
frees the page table space in the iommu

The problem, exactly as you indicate is that on a kdump panic, we might boot the
new kernel and re-enable the iommu with these dmas still in flight. If we start
messing about with the iommu page tables then, we start getting all sorts of
errors, and other various failures.

> Hence one of the suggestions was not to clear iommu mapping entries but
> reserve some for kdump operation and use those in kdump kernel.
>
Yeah, thats a solution, but it seems awfully complex to me. To do that, we need
to teach every iommu we support about kdump, by telling it how much space to
reserve, and when to use it and when not to (i.e. we'd have to tell it to use
the kdump space, vs the normal space dependent on the status of the
reset_devices flag, or something equally unpleasant).

Actually, thinking about it, I'm not sure that will even work, as IIRC the iommu
only has one page table base pointer. So we would either need to re-write that
pointer to point into the kdump kernels memory space (invalidating the old table
entries, which perpetuates this bug), or we would need to further enhance the
iommu code to be able to access the old page tables via
read_from_oldmem/write_to_oldmem when booting a kdump kernel, wouldn't we?

Using this method, all we really do is try to ensure that, prior to disabling
the iommu, we make sure that any pending dmas are complete. That way, when we
re-enable the iommu in the kdump kernel, we can safely maniuplate the new page
tables, knowing that no pending dma is using them

In fairness to this debate, my proposal does have a small race condition. In
the above sequence, because the cpu triggers a dma independently of the setup of
the mapping in the iommu, it is possible that a dma might be triggered
immediately after we flush the iotlb, which may leave an in-flight dma pending
while we boot the kdump kernel. In practice though, this will never happen. By
the time we arrive at this code, we've already executed
native_machine_crash_shutdown which:

1) halts all the other cpus in the system
2) disables local interrupts

Because of those two events, we're effectively on a path that we can't be
preempted-from. So as long as we don't trigger any dma operations between our
return from iommu_shutdown and machine_kexec (which is the next call), we're
safe.

> So this call amd_iommu_flush_all_devices() will be able to tell devices
> that don't do any more DMAs and hence it is safe to reprogram iommu
> mapping entries.
>
It blocks the cpu until any pending DMA operations are complete. Hmm, as I
think about it, there is still a small possibility that a device like a NIC
which has several buffers pre-dma-mapped could start a new dma before we
completely disabled the iommu, althought thats small. I never saw that in my
testing, but hitting that would be fairly difficult I think, since its literally
just a few hundred cycles between the flush and the actual hardware disable
operation.

According to this though:
http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf
That window could be closed fairly easily, but simply disabling read and write
permissions for each device table entry prior to calling flush. If we do that,
then flush the device table, any subsequently started dma operation would just
get noted in the error log, which we could ignore, since we're abot to boot to
the kdump kernel anyway.

Would you like me to respin w/ that modification?

Neil

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric W. Biederman on 31 Mar 2010 14:50

Vivek Goyal <vgoyal(a)redhat.com> writes:

> On Wed, Mar 31, 2010 at 11:24:17AM -0400, Neil Horman wrote:
>> Flush iommu during shutdown
>>
>> When using an iommu, its possible, if a kdump kernel boot follows a primary
>> kernel crash, that dma operations might still be in flight from the previous
>> kernel during the kdump kernel boot. This can lead to memory corruption,
>> crashes, and other erroneous behavior, specifically I've seen it manifest during
>> a kdump boot as endless iommu error log entries of the form:
>> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.1 domain=0x000d
>> address=0x000000000245a0c0 flags=0x0070]
>>
>> Followed by an inability to access hard drives, and various other resources.
>>
>> I've written this fix for it. In short it just forces a flush of the in flight
>> dma operations on shutdown, so that the new kernel is certain not to have any
>> in-flight dmas trying to complete after we've reset all the iommu page tables,
>> causing the above errors. I've tested it and it fixes the problem for me quite
>> well.
>
> CCing Eric also.

Thanks.

> Neil, this is interesting. In the past we noticed similar issues,
> especially on PPC. But I was told that we could not clear the iommu
> mapping entries as we had no control on in flight DMA and if a DMA comes
> later after clearing an entry and entry is not present, it is an error.

Which is exactly what the reported error looks like.

> Hence one of the suggestions was not to clear iommu mapping entries but
> reserve some for kdump operation and use those in kdump kernel.
>
> So this call amd_iommu_flush_all_devices() will be able to tell devices
> that don't do any more DMAs and hence it is safe to reprogram iommu
> mapping entries.

I took a quick look at our crash shutdown path and I am very disappointed
with the way it has gone lately.

Regardless of the merits flushing an iommu versus doing things with an
iommu I don't see how we are in any better position in the crash kernel
then we are in the kdump kernel. So what are we doing touching it
in the kdump path?

Likewise for the hpet.

We also seem to be at a point where if we have a tsc we don't need to
enable interrupts until we are ready to enable them in native mode. And
except for a few weird SMP 486's tsc and apics came in at the same time.

So my grumpy code review says we should gut crash.c (like below) and
fix the initialization paths so they do the right thing.

---
arch/x86/kernel/crash.c | 18 ------------------
1 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index a4849c1..8e33c50 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -22,12 +22,10 @@
#include <asm/nmi.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
-#include <asm/hpet.h>
#include <linux/kdebug.h>
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
-#include <asm/x86_init.h>

#if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)

@@ -56,15 +54,11 @@ static void kdump_nmi_callback(int cpu, struct die_args *args)
*/
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();
-
- disable_local_APIC();
}

static void kdump_nmi_shootdown_cpus(void)
{
nmi_shootdown_cpus(kdump_nmi_callback);
-
- disable_local_APIC();
}

#else
@@ -96,17 +90,5 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();

- lapic_shutdown();
-#if defined(CONFIG_X86_IO_APIC)
- disable_IO_APIC();
-#endif
-#ifdef CONFIG_HPET_TIMER
- hpet_disable();
-#endif
-
-#ifdef CONFIG_X86_64
- x86_platform.iommu_shutdown();
-#endif
-
crash_save_cpu(regs, safe_smp_processor_id());
}
--
1.6.5.2.143.g8cc62

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric W. Biederman on 31 Mar 2010 15:00

Neil Horman <nhorman(a)tuxdriver.com> writes:

> On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote:

>> So this call amd_iommu_flush_all_devices() will be able to tell devices
>> that don't do any more DMAs and hence it is safe to reprogram iommu
>> mapping entries.
>>
> It blocks the cpu until any pending DMA operations are complete. Hmm, as I
> think about it, there is still a small possibility that a device like a NIC
> which has several buffers pre-dma-mapped could start a new dma before we
> completely disabled the iommu, althought thats small. I never saw that in my
> testing, but hitting that would be fairly difficult I think, since its literally
> just a few hundred cycles between the flush and the actual hardware disable
> operation.
>
> According to this though:
> http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf
> That window could be closed fairly easily, but simply disabling read and write
> permissions for each device table entry prior to calling flush. If we do that,
> then flush the device table, any subsequently started dma operation would just
> get noted in the error log, which we could ignore, since we're abot to boot to
> the kdump kernel anyway.
>
> Would you like me to respin w/ that modification?

Disabling permissions on all devices sounds good for the new virtualization
capable iommus. I think older iommus will still be challenged. I think
on x86 we have simply been able to avoid using those older iommus.

I like the direction you are going but please let's put this in a
paranoid iommu enable routine.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Horman on 31 Mar 2010 15:20

On Wed, Mar 31, 2010 at 11:57:46AM -0700, Eric W. Biederman wrote:
> Neil Horman <nhorman(a)tuxdriver.com> writes:
>
> > On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote:
>
> >> So this call amd_iommu_flush_all_devices() will be able to tell devices
> >> that don't do any more DMAs and hence it is safe to reprogram iommu
> >> mapping entries.
> >>
> > It blocks the cpu until any pending DMA operations are complete. Hmm, as I
> > think about it, there is still a small possibility that a device like a NIC
> > which has several buffers pre-dma-mapped could start a new dma before we
> > completely disabled the iommu, althought thats small. I never saw that in my
> > testing, but hitting that would be fairly difficult I think, since its literally
> > just a few hundred cycles between the flush and the actual hardware disable
> > operation.
> >
> > According to this though:
> > http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf
> > That window could be closed fairly easily, but simply disabling read and write
> > permissions for each device table entry prior to calling flush. If we do that,
> > then flush the device table, any subsequently started dma operation would just
> > get noted in the error log, which we could ignore, since we're abot to boot to
> > the kdump kernel anyway.
> >
> > Would you like me to respin w/ that modification?
>
> Disabling permissions on all devices sounds good for the new virtualization
> capable iommus. I think older iommus will still be challenged. I think
> on x86 we have simply been able to avoid using those older iommus.
>
> I like the direction you are going but please let's put this in a
> paranoid iommu enable routine.
>
You mean like initialize the device table so that all devices are default
disabled on boot, and then selectively enable them (perhaps during a
device_attach)? I can give that a spin.
Neil

> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6
Prev: RFC [Patch] Remove "please try 'cgroup_disable=memory' option if you don't want memory cgroups" printk at boot time.
Next: [PATCH v2 3/11] Enhance replace_page() to support pagecache