sky2: receive dma mapping error handling [Kernel]

Prev: [PATCH] ntp: Make time_esterror and time_maxerror static
Next: Confirm Your Email

From: Michael Breuer on 28 Jan 2010 19:10

From: Michael Breuer on 30 Jan 2010 11:40

On 01/28/2010 06:36 PM, Stephen Hemminger wrote:
> Please try this patch (and only this patch), on 2.6.33-rc5[*];
> none of the other patches that did not make it upstream because that
> confuses things too much.
>
> The code that checks for DMA mapping errors on receive buffers would
> not handle errors correctly. I doubt you have these errors, but if you
> did then it would explain the problems. The code has to be a little
> tricky and build mapping for new rx buffer before releasing old one,
> that way if new mapping fails, the old one can be reused.
>
> If it works for you, I will resubmit with signed-off.
>
> -
>
Nope - tx crash again. This time the system stayed up (but hosed) for a
few hours. When I tried to recover eth0 the system then crashed.

Brief summary of events (log extract below):

System start Jan 28 19:29
Everything seemed good (load and all) until 17:13:11 the following day
when I got rx errors:

Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:14 mail kernel: sky2 eth0: rx error, status 0x5f60010
length 1518

The system continued running normally after this until this morning (Jan
30) at 0:44:55:
Jan 30 05:44:55 mail kernel: DRHD: handling fault status reg 2
Jan 30 05:44:55 mail kernel: DMAR:[DMA Read] Request device [06:00.0]
fault addr ffc4331ff000
Jan 30 05:44:55 mail kernel: DMAR:[fault reason 06] PTE Read access is
not set
Jan 30 05:44:55 mail kernel: net_ratelimit: 2 callbacks suppressed
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: error interrupt
status=0xc0000000
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
Jan 30 05:45:01 mail kernel: ------------[ cut here ]------------
Jan 30 05:45:01 mail kernel: WARNING: at net/sched/sch_generic.c:255
dev_watchdog+0xf3/0x161()
Jan 30 05:45:01 mail kernel: Hardware name: System Product Name
Jan 30 05:45:01 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit
queue 0 timed out
Jan 30 05:45:01 mail kernel: Modules linked in: iptable_raw
iptable_mangle ipt_MASQUERADE iptable_nat nf_nat cpufreq_stats
ip6table_filter ip6table_mangle ip6_tables bridge stp appletalk psnap
llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc
acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns
nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport
ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_hda_intel
snd_ens1371 gameport snd_hda_codec snd_rawmidi snd_ac97_codec
gspca_spca505 ac97_bus gspca_main snd_hwdep videodev snd_seq
snd_seq_device v4l1_compat snd_pcm v4l2_compat_ioctl32 snd_timer snd
soundcore snd_page_alloc firewire_ohci pcspkr i2c_i801 firewire_core wmi
asus_atk0110 crc_itu_t sky2 hwmon iTCO_wdt iTCO_vendor_support fbcon
tileblit font bitblit softcursor raid456 async_raid6_recov async_pq
raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi
pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit
cfbcopyarea i2c_core cf
Jan 30 05:45:01 mail kernel: bimgblt cfbfillrect [last unloaded: nf_nat]
Jan 30 05:45:01 mail kernel: Pid: 0, comm: swapper Tainted: G W
2.6.33-rc5WITHMMAPNODMARFORKTIPSKY2DMAMAP-00283-gd4d37bd-dirty #1
Jan 30 05:45:01 mail kernel: Call Trace:
Jan 30 05:45:01 mail kernel: <IRQ> [<ffffffff8104a03d>]
warn_slowpath_common+0x7c/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8104a0ac>]
warn_slowpath_fmt+0x41/0x43
Jan 30 05:45:01 mail kernel: [<ffffffff813d2f43>] ? netif_tx_lock+0x44/0x6c
Jan 30 05:45:01 mail kernel: [<ffffffff813d30ab>] dev_watchdog+0xf3/0x161
Jan 30 05:45:01 mail kernel: [<ffffffff8106a31f>] ?
sched_clock_cpu+0x44/0xce
Jan 30 05:45:01 mail kernel: [<ffffffff8105761a>]
run_timer_softirq+0x1c3/0x26b
Jan 30 05:45:01 mail kernel: [<ffffffff8105060c>] __do_softirq+0xf8/0x1cd
Jan 30 05:45:01 mail kernel: [<ffffffff8107192b>] ?
tick_program_event+0x2a/0x2c
Jan 30 05:45:01 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30
Jan 30 05:45:01 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3
Jan 30 05:45:01 mail kernel: [<ffffffff810501f8>] irq_exit+0x4a/0x8c
Jan 30 05:45:01 mail kernel: [<ffffffff81461859>]
smp_apic_timer_interrupt+0x86/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8100a5d3>]
apic_timer_interrupt+0x13/0x20
Jan 30 05:45:01 mail kernel: <EOI> [<ffffffff812afbd4>] ?
acpi_idle_enter_bm+0x256/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff812afbcd>] ?
acpi_idle_enter_bm+0x24f/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff8139574c>]
cpuidle_idle_call+0x9e/0xfa
Jan 30 05:45:01 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6
Jan 30 05:45:01 mail kernel: [<ffffffff81455d48>]
start_secondary+0x201/0x242
Jan 30 05:45:01 mail kernel: ---[ end trace 57f7151f6a5def07 ]---
Jan 30 05:45:01 mail kernel: sky2 eth0: tx timeout
Jan 30 05:45:01 mail kernel: sky2 eth0: transmit ring 14 .. 102
report=14 done=14
Jan 30 05:45:01 mail kernel: sky2 eth0: disabling interface
Jan 30 05:45:01 mail kernel: sky2 eth0: enabling interface

This down/up continued for several hours until I intervened at about 10:05.

I saw that there was no eth0 connectivity, eth1 was ok. It appeard that
eth0 was receiving traffic but unable to send. arpwatch was reporting
bogons, DHCP showed many DISCOVER/OFFER pairs, no REQUEST/ACK. Pings to
any system failed; arp showed incomplete for anything hanging off of
eth0. arping also failed.
I manually stopped and started eth0 (ifconfig) and reset iptables
(although eth0 has no filters).

As I started looking at logs, the system hung and rebooted. I'm up now
with dma debug enabled, however as with 2.6.32.4 num_entries is dropping
and I don't think that dma debug will remain enabled long enough to
catch a crash.

So, as I see things, there are two issues here: 1) the TX hang post DMAR
error and 2) the inability to recover the interface and subsequent
system instability.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael Breuer on 30 Jan 2010 11:40

On 1/28/2010 6:36 PM, Stephen Hemminger wrote:
> Please try this patch (and only this patch), on 2.6.33-rc5[*];
> none of the other patches that did not make it upstream because that
> confuses things too much.
>
> The code that checks for DMA mapping errors on receive buffers would
> not handle errors correctly. I doubt you have these errors, but if you
> did then it would explain the problems. The code has to be a little
> tricky and build mapping for new rx buffer before releasing old one,
> that way if new mapping fails, the old one can be reused.
>
> If it works for you, I will resubmit with signed-off.
>
Nope - tx crash again. This time the system stayed up (but hosed) for a
few hours. When I tried to recover eth0, the system crashed.
Brief summary of events (log extract below):

System start Jan 28 19:29
Everything seemed good (load and all) until 17:13:11 the following day
when I got rx errors:

Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010
length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010
length 1518
Jan 29 17:13:14 mail kernel: sky2 eth0: rx error, status 0x5f60010
length 1518

The system continued running normally after this until this morning (Jan
30) at 0:44:55:
Jan 30 05:44:55 mail kernel: DRHD: handling fault status reg 2
Jan 30 05:44:55 mail kernel: DMAR:[DMA Read] Request device [06:00.0]
fault addr ffc4331ff000
Jan 30 05:44:55 mail kernel: DMAR:[fault reason 06] PTE Read access is
not set
Jan 30 05:44:55 mail kernel: net_ratelimit: 2 callbacks suppressed
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: error interrupt
status=0xc0000000
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
Jan 30 05:45:01 mail kernel: ------------[ cut here ]------------
Jan 30 05:45:01 mail kernel: WARNING: at net/sched/sch_generic.c:255
dev_watchdog+0xf3/0x161()
Jan 30 05:45:01 mail kernel: Hardware name: System Product Name
Jan 30 05:45:01 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit
queue 0 timed out
Jan 30 05:45:01 mail kernel: Modules linked in: iptable_raw
iptable_mangle ipt_MASQUERADE iptable_nat nf_nat cpufreq_stats
ip6table_filter ip6table_mangle ip6_tables bridge stp appletalk psnap
llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc
acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns
nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport
ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_hda_intel
snd_ens1371 gameport snd_hda_codec snd_rawmidi snd_ac97_codec
gspca_spca505 ac97_bus gspca_main snd_hwdep videodev snd_seq
snd_seq_device v4l1_compat snd_pcm v4l2_compat_ioctl32 snd_timer snd
soundcore snd_page_alloc firewire_ohci pcspkr i2c_i801 firewire_core wmi
asus_atk0110 crc_itu_t sky2 hwmon iTCO_wdt iTCO_vendor_support fbcon
tileblit font bitblit softcursor raid456 async_raid6_recov async_pq
raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi
pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit
cfbcopyarea i2c_core cf
Jan 30 05:45:01 mail kernel: bimgblt cfbfillrect [last unloaded: nf_nat]
Jan 30 05:45:01 mail kernel: Pid: 0, comm: swapper Tainted: G W
2.6.33-rc5WITHMMAPNODMARFORKTIPSKY2DMAMAP-00283-gd4d37bd-dirty #1
Jan 30 05:45:01 mail kernel: Call Trace:
Jan 30 05:45:01 mail kernel: <IRQ> [<ffffffff8104a03d>]
warn_slowpath_common+0x7c/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8104a0ac>]
warn_slowpath_fmt+0x41/0x43
Jan 30 05:45:01 mail kernel: [<ffffffff813d2f43>] ? netif_tx_lock+0x44/0x6c
Jan 30 05:45:01 mail kernel: [<ffffffff813d30ab>] dev_watchdog+0xf3/0x161
Jan 30 05:45:01 mail kernel: [<ffffffff8106a31f>] ?
sched_clock_cpu+0x44/0xce
Jan 30 05:45:01 mail kernel: [<ffffffff8105761a>]
run_timer_softirq+0x1c3/0x26b
Jan 30 05:45:01 mail kernel: [<ffffffff8105060c>] __do_softirq+0xf8/0x1cd
Jan 30 05:45:01 mail kernel: [<ffffffff8107192b>] ?
tick_program_event+0x2a/0x2c
Jan 30 05:45:01 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30
Jan 30 05:45:01 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3
Jan 30 05:45:01 mail kernel: [<ffffffff810501f8>] irq_exit+0x4a/0x8c
Jan 30 05:45:01 mail kernel: [<ffffffff81461859>]
smp_apic_timer_interrupt+0x86/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8100a5d3>]
apic_timer_interrupt+0x13/0x20
Jan 30 05:45:01 mail kernel: <EOI> [<ffffffff812afbd4>] ?
acpi_idle_enter_bm+0x256/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff812afbcd>] ?
acpi_idle_enter_bm+0x24f/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff8139574c>]
cpuidle_idle_call+0x9e/0xfa
Jan 30 05:45:01 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6
Jan 30 05:45:01 mail kernel: [<ffffffff81455d48>]
start_secondary+0x201/0x242
Jan 30 05:45:01 mail kernel: ---[ end trace 57f7151f6a5def07 ]---
Jan 30 05:45:01 mail kernel: sky2 eth0: tx timeout
Jan 30 05:45:01 mail kernel: sky2 eth0: transmit ring 14 .. 102
report=14 done=14
Jan 30 05:45:01 mail kernel: sky2 eth0: disabling interface
Jan 30 05:45:01 mail kernel: sky2 eth0: enabling interface

This down/up continued for several hours until I intervened at about 10:05.

I saw that there was no eth0 connectivity, eth1 was ok. It appeard that
eth0 was receiving traffic but unable to send. arpwatch was reporting
bogons, DHCP showed many DISCOVER/OFFER pairs, no REQUEST/ACK. Pings to
any system failed; arp showed incomplete for anything hanging off of
eth0. arping also failed.
I manually stopped and started eth0 (ifconfig) and reset iptables
(although eth0 has no filters).

As I started looking at logs, the system hung and rebooted. I'm up now
with dma debug enabled, however as with 2.6.32.4 num_entries is dropping
and I don't think that dma debug will remain enabled long enough to
catch a crash.

So, as I see things, there are two issues here: 1) the TX hang post DMAR
error and 2) the inability to recover the interface and subsequent
system instability.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jarek Poplawski on 30 Jan 2010 19:40

On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
> On 01/28/2010 06:36 PM, Stephen Hemminger wrote:
> >Please try this patch (and only this patch), on 2.6.33-rc5[*];
> >none of the other patches that did not make it upstream because that
> >confuses things too much.
> >
> >The code that checks for DMA mapping errors on receive buffers would
> >not handle errors correctly. I doubt you have these errors, but if you
> >did then it would explain the problems. The code has to be a little
> >tricky and build mapping for new rx buffer before releasing old one,
> >that way if new mapping fails, the old one can be reused.
> >
> >If it works for you, I will resubmit with signed-off.
> >
> >-
> >
> Nope - tx crash again. This time the system stayed up (but hosed)
> for a few hours. When I tried to recover eth0 the system then
> crashed.
>
> Brief summary of events (log extract below):
>
> System start Jan 28 19:29
> Everything seemed good (load and all) until 17:13:11 the following
> day when I got rx errors:
>
> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
> length 1518
> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
> length 1518

These are length errors, but status shows more than 1518, e.g. 2036
here, unless I miss something. Please, don't use jumbo frames in your
network until we fully debug it for regular frames (Stephen admitted
sky2 jumbo might be broken).

....
> As I started looking at logs, the system hung and rebooted. I'm up
> now with dma debug enabled, however as with 2.6.32.4 num_entries is
> dropping and I don't think that dma debug will remain enabled long
> enough to catch a crash.

Could you try the patch below to show maybe some other users of
dma-debug entries?

Jarek P.
---

lib/dma-debug.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 7d2f0b3..e2dcc9c 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -310,6 +310,53 @@ static void hash_bucket_del(struct dma_debug_entry *entry)
list_del(&entry->list);
}

+struct dma_debug_dev {
+ struct device *dev;
+ unsigned int cnt;
+};
+
+#define DMA_DEBUG_DEVS 100
+static struct dma_debug_dev dma_debug_devs[DMA_DEBUG_DEVS];
+
+static void debug_dma_dump_devs(void)
+{
+ int idx, i;
+
+ memset(dma_debug_devs, 0, sizeof(struct dma_debug_dev) * DMA_DEBUG_DEVS);
+
+ for (idx = 0; idx < HASH_SIZE; idx++) {
+ struct hash_bucket *bucket = &dma_entry_hash[idx];
+ struct dma_debug_entry *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&bucket->lock, flags);
+
+ list_for_each_entry(entry, &bucket->list, list) {
+ for (i = 0; i < DMA_DEBUG_DEVS; i++) {
+ struct device *dev = dma_debug_devs[i].dev;
+
+ if (!dev || dev == entry->dev) {
+ dma_debug_devs[i].dev = entry->dev;
+ dma_debug_devs[i].cnt++;
+ break;
+ }
+ }
+ }
+
+ spin_unlock_irqrestore(&bucket->lock, flags);
+ }
+
+ for (i = 0; i < DMA_DEBUG_DEVS; i++) {
+ struct device *dev = dma_debug_devs[i].dev;
+
+ if (!dev)
+ break;
+
+ pr_info("DMA-API: %s: entries: %d\n", dev_name(dev),
+ dma_debug_devs[i].cnt);
+ }
+}
+
/*
* Dump mapping entries for debugging purposes
*/
@@ -363,8 +410,11 @@ static struct dma_debug_entry *__dma_entry_alloc(void)
memset(entry, 0, sizeof(*entry));

num_free_entries -= 1;
- if (num_free_entries < min_free_entries)
+ if (num_free_entries < min_free_entries) {
min_free_entries = num_free_entries;
+ if ((min_free_entries & 0xffff) == 0)
+ debug_dma_dump_devs();
+ }

return entry;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Michael Breuer on 30 Jan 2010 23:20

On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
> On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
>
>> On 01/28/2010 06:36 PM, Stephen Hemminger wrote:
>>
>>> Please try this patch (and only this patch), on 2.6.33-rc5[*];
>>> none of the other patches that did not make it upstream because that
>>> confuses things too much.
>>>
>>> The code that checks for DMA mapping errors on receive buffers would
>>> not handle errors correctly. I doubt you have these errors, but if you
>>> did then it would explain the problems. The code has to be a little
>>> tricky and build mapping for new rx buffer before releasing old one,
>>> that way if new mapping fails, the old one can be reused.
>>>
>>> If it works for you, I will resubmit with signed-off.
>>>
>>> -
>>>
>>>
>> Nope - tx crash again. This time the system stayed up (but hosed)
>> for a few hours. When I tried to recover eth0 the system then
>> crashed.
>>
>> Brief summary of events (log extract below):
>>
>> System start Jan 28 19:29
>> Everything seemed good (load and all) until 17:13:11 the following
>> day when I got rx errors:
>>
>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
>> length 1518
>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
>> length 1518
>>
> These are length errors, but status shows more than 1518, e.g. 2036
> here, unless I miss something. Please, don't use jumbo frames in your
> network until we fully debug it for regular frames (Stephen admitted
> sky2 jumbo might be broken).
>
MTU was 1500 - not using jumbo frames as they don't work.
> ...
>
>> As I started looking at logs, the system hung and rebooted. I'm up
>> now with dma debug enabled, however as with 2.6.32.4 num_entries is
>> dropping and I don't think that dma debug will remain enabled long
>> enough to catch a crash.
>>
> Could you try the patch below to show maybe some other users of
> dma-debug entries?
>
> Jarek P.
> ---
>
Will do. Note that I'm running with the dma debug filter set to sky2.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [PATCH] ntp: Make time_esterror and time_maxerror static
Next: Confirm Your Email