Prev: New HID device Philips Remote RC 153_Vista
Next: tpm_infineon: Fix suspend/resume handler for pnp_driver
From: Jarek Poplawski on 7 Jan 2010 15:30 On Thu, Jan 07, 2010 at 02:55:22PM -0500, Michael Breuer wrote: > On 1/7/2010 2:36 PM, Jarek Poplawski wrote: > >On Thu, Jan 07, 2010 at 07:50:40PM +0100, Jarek Poplawski wrote: > >>>Going to rerun with these patches and with and without MMAP. Will > >>>also retry both with jumbo frames if possible. > >>If MMAP then some "alternative" too. But first no MMAP. > >Another things IMHO worth to try: a sky2 module parameter > >"disable_msi=1", and CONFIG_DMAR off. > > > >Jarek P. > Ok - that'd be with or without MMAP enabled? (note that so-far, > without MMAP I'm not seeing any errors - throughput is running about > half what I was seeing with MMAP enabled (before crashing that is). > CPU is also way busier (also to be expected). One other observation > - I had been seeing lots of DNS errors - IPV6 related format errors > I really didn't think much of it as they were mostly .ru and seemed > spam-related, but now I don't see any. Haven't updated bind; doubt > the world has changed - so perhaps this is related to the network > issue. MMAP enabled (with some "alternative" patch - to avoid known bugs) should give as earlier the answer if these changes matter. But first let's try longer (if possible) if this "no MMAP" could really heal your hardware. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on 7 Jan 2010 18:20 On 1/7/2010 1:50 PM, Jarek Poplawski wrote: > On Thu, Jan 07, 2010 at 01:43:08PM -0500, Michael Breuer wrote: > >> On 1/7/2010 1:35 PM, Jarek Poplawski wrote: >> >>> On Thu, Jan 07, 2010 at 01:19:41PM -0500, Michael Breuer wrote: >>> >>>> On 1/7/2010 1:01 PM, Jarek Poplawski wrote: >>>> >>>>> On Thu, Jan 07, 2010 at 10:05:37AM -0500, Michael Breuer wrote: >>>>> >>>>>> Bad news - crashed about an hour after I wrote this email - under >>>>>> load - same crash as before. Network watchdog... lots of attempts to >>>>>> reset the adapter... then hw watchdog rebooted the system. >>>>>> >>>>> It's a pity. Anyway, I'd be still interested in CONFIG_PACKET_MMAP off >>>>> if you find time. >>>>> >>>>> Jarek P. >>>>> >>>> Ok - any particular patch set to try with? I'm going to start with a >>>> clean tree using the latest 2.6.32 from git (tried 2.6.33-rc3, but >>>> can't get a usable console... will look at that later.) >>>> >>> My "Berck E. Nash" and Stephen's "pskb_may_pull" sky2 patches. (BTW, >>> could you remind if it worked any better with 2.6.31 or earlier?) >>> >>> Jarek P. >>> >> I'm not sure my crash-and-burn runs yesterday included the >> pskb_may_pull patch :( >> >> Going to rerun with these patches and with and without MMAP. Will >> also retry both with jumbo frames if possible. >> > If MMAP then some "alternative" too. But first no MMAP. > > Jarek P. > Results: * no MMAP, mtu=1500, neither alternative patch loaded: adapter crashed: Jan 7 15:44:23 mail kernel: DRHD: handling fault status reg 2 Jan 7 15:44:23 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr fffb7bffe000 Jan 7 15:44:23 mail kernel: DMAR:[fault reason 06] PTE Read access is not set Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: error interrupt status=0x80000000 Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] lib/util_sock.c:539(read_fd_with_timeout) Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] lib/util_sock.c:1491(get_peer_addr_internal) Jan 7 15:44:24 mail smbd[6572]: getpeername failed. Error was Transport endpoint is not connected Jan 7 15:44:24 mail smbd[6572]: read_fd_with_timeout: client 0.0.0.0 read error = Connection timed out. Jan 7 15:44:44 mail kernel: ------------[ cut here ]------------ Jan 7 15:44:44 mail kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164() Jan 7 15:44:44 mail kernel: Hardware name: System Product Name Jan 7 15:44:44 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out Jan 7 15:44:44 mail kernel: Modules linked in: ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device snd_pcm gspca_spca505 gspca_main firewire_ohci videodev v4l1_compat firewire_core pcspkr v4l2_compat_ioctl32 snd_timer iTCO_wdt i2c_i801 crc_itu_t iTCO_vendor_support snd soundcore snd_page_alloc sky2 wmi asus_atk0110 hwmon fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil Jan 7 15:44:44 mail kernel: lrect [last unloaded: microcode] Jan 7 15:44:44 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32NOMMAP-00847-g50ebb93-dirty #4 Jan 7 15:44:44 mail kernel: Call Trace: Jan 7 15:44:44 mail kernel: <IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94 Jan 7 15:44:44 mail kernel: [<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43 Jan 7 15:44:44 mail kernel: [<ffffffff813e2dcf>] ? netif_tx_lock+0x44/0x6c Jan 7 15:44:44 mail kernel: [<ffffffff813e2f37>] dev_watchdog+0xf3/0x164 Jan 7 15:44:44 mail kernel: [<ffffffff8106e8a4>] ? __queue_work+0x3a/0x42 Jan 7 15:44:44 mail kernel: [<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270 Jan 7 15:44:44 mail kernel: [<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd Jan 7 15:44:44 mail kernel: [<ffffffff8107ef33>] ? tick_program_event+0x2a/0x2c Jan 7 15:44:44 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30 Jan 7 15:44:44 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6 Jan 7 15:44:44 mail kernel: [<ffffffff8105aa1b>] irq_exit+0x4a/0x8c Jan 7 15:44:44 mail kernel: [<ffffffff8146e3e2>] smp_apic_timer_interrupt+0x86/0x94 Jan 7 15:44:44 mail kernel: [<ffffffff810127e3>] apic_timer_interrupt+0x13/0x20 Jan 7 15:44:44 mail kernel: <EOI> [<ffffffff812c678a>] ? acpi_idle_enter_bm+0x256/0x28a Jan 7 15:44:44 mail kernel: [<ffffffff812c6783>] ? acpi_idle_enter_bm+0x24f/0x28a Jan 7 15:44:44 mail kernel: [<ffffffff813a5ec8>] ? cpuidle_idle_call+0x9e/0xfa Jan 7 15:44:44 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6 Jan 7 15:44:44 mail kernel: [<ffffffff814639c2>] ? start_secondary+0x201/0x242 Jan 7 15:44:44 mail kernel: ---[ end trace 57f7151f6a5def07 ]--- Jan 7 15:44:44 mail kernel: sky2 eth0: tx timeout Jan 7 15:44:44 mail kernel: sky2 eth0: transmit ring 77 .. 36 report=77 done=77 Jan 7 15:44:44 mail kernel: sky2 eth0: disabling interface Jan 7 15:44:44 mail kernel: sky2 eth0: enabling interface --- adapter dead after this --- rebooted. * no MMAP; alternative 1 patch, mtu=1500; no errors; sustained transfer rates about 25% lower than what I saw with mmap enabled...(before MMAP enabled crashed). * no MMAP mtu=9000; ran ok at low transfer rates - when high rates kicked in, got the sky2 interrupt error & things went south: Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008 After this, remote connections broke and I rebooted... decided to rerun w/o MMAP again before going back to MMAP and trying those other sky2 options... * Retest of no MMAP + Alternative 1 - just to confirm consistency. Worked - no errors. Only version so far that allows the win7 backup to complete. * MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving this one running for a while - also completed a backup successfully. Fastest of the lot... about 3x faster than any other version, working or not. I'm leaving this one running for now. Not retesting jumbo for now. Be happy to help dig further. Tentative recommendations: 1) The af alternative patch seems rather necessary. First alternative seems to be working, I'd suggest that be submitted and backported to 2.6.32. 2) Steven's pskb_may_pull patch also ought to be included and backported. 3) Jumbo frame support for yukon2 should probably be disabled until/if fixed. 4) When possible I'll test dmar and disable_msi, and no dmar and no disable_msi. When I first hit issues, I was running without DMAR, but also without the above patches. I suppose the non-working permutations need to be either fixed or invalidated (or well documented). 5) It would be nice if someone with comparable hardware could reproduce these issues. FWIW, I can only recreate the crash running windows backup to a cifs share. Copying large files doesn't seem to do it. Could also be some other interaction going on here that perhaps others aren't running - would be happy to compare notes. Notes: This *could* be coincidental, but maybe not... With MMAP+NO DMAR + disable_msi there are far fewer ... actually almost no... bind error reports... and no bind format error messages. With NOMMAP and alternative one there are a few more bind error messages and one format error message during the several hours that version was up. All other configurations going back perhaps for two weeks have significantly more bind error reports - and all versions show increasing frequency of bind format errors (IPV6 only) in the roughly 10-15 minutes preceding the lockup/crash/interrupt error messages. There are none immediately preceding any crash, but perhaps there is some correlation between the network errors and bind ipv6 packets. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on 8 Jan 2010 02:50 On Thu, Jan 07, 2010 at 06:11:34PM -0500, Michael Breuer wrote: > Results: > * no MMAP, mtu=1500, neither alternative patch loaded: adapter crashed: > Jan 7 15:44:23 mail kernel: DRHD: handling fault status reg 2 > Jan 7 15:44:23 mail kernel: DMAR:[DMA Read] Request device [06:00.0] > fault addr fffb7bffe000 > Jan 7 15:44:23 mail kernel: DMAR:[fault reason 06] PTE Read access is > not set > Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x80000000 > Jan 7 15:44:23 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010) > Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] > lib/util_sock.c:539(read_fd_with_timeout) > Jan 7 15:44:24 mail smbd[6572]: [2010/01/07 15:44:24, 0] > lib/util_sock.c:1491(get_peer_addr_internal) > Jan 7 15:44:24 mail smbd[6572]: getpeername failed. Error was > Transport endpoint is not connected > Jan 7 15:44:24 mail smbd[6572]: read_fd_with_timeout: client 0.0.0.0 > read error = Connection timed out. > Jan 7 15:44:44 mail kernel: ------------[ cut here ]------------ > Jan 7 15:44:44 mail kernel: WARNING: at net/sched/sch_generic.c:261 > dev_watchdog+0xf3/0x164() > Jan 7 15:44:44 mail kernel: Hardware name: System Product Name > Jan 7 15:44:44 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit > queue 0 timed out > Jan 7 15:44:44 mail kernel: Modules linked in: ip6table_filter > ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat nf_nat > iptable_mangle iptable_raw bridge stp appletalk psnap llc nfsd lockd > nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit > tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp > xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm > snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec > snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device > snd_pcm gspca_spca505 gspca_main firewire_ohci videodev v4l1_compat > firewire_core pcspkr v4l2_compat_ioctl32 snd_timer iTCO_wdt i2c_i801 > crc_itu_t iTCO_vendor_support snd soundcore snd_page_alloc sky2 wmi > asus_atk0110 hwmon fbcon tileblit font bitblit softcursor raid456 > async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx > raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm > agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfil > Jan 7 15:44:44 mail kernel: lrect [last unloaded: microcode] > Jan 7 15:44:44 mail kernel: Pid: 0, comm: swapper Tainted: G W BTW, was there any other oops saved before this one? .... > --- adapter dead after this --- rebooted. > * no MMAP; alternative 1 patch, mtu=1500; no errors; sustained transfer > rates about 25% lower than what I saw with mmap enabled...(before MMAP > enabled crashed). ?? Read below... > * no MMAP mtu=9000; ran ok at low transfer rates - when high rates > kicked in, got the sky2 interrupt error & things went south: > Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > Jan 7 15:09:28 mail kernel: sky2 0000:06:00.0: error interrupt > status=0x40000008 > After this, remote connections broke and I rebooted... decided to rerun > w/o MMAP again before going back to MMAP and trying those other sky2 > options... > * Retest of no MMAP + Alternative 1 - just to confirm consistency. > Worked - no errors. Only version so far that allows the win7 backup to > complete. ??? Hmm... Alternative 1 or 2 doesn't even compile into when no MMAP, so it definitely needs re-retesting ;-) > * MMAP + NO DMAR + disable_msi=1... also works w/o errors... leaving > this one running for a while - also completed a backup successfully. > Fastest of the lot... about 3x faster than any other version, working or > not. Very interesting. It would be nice to give it a really long try, and if still true, try MMAP + NO DMAR only. > > I'm leaving this one running for now. Not retesting jumbo for now. Be > happy to help dig further. > > Tentative recommendations: > > 1) The af alternative patch seems rather necessary. First alternative > seems to be working, I'd suggest that be submitted and backported to > 2.6.32. > 2) Steven's pskb_may_pull patch also ought to be included and backported. > 3) Jumbo frame support for yukon2 should probably be disabled until/if > fixed. > 4) When possible I'll test dmar and disable_msi, and no dmar and no > disable_msi. When I first hit issues, I was running without DMAR, but > also without the above patches. I suppose the non-working permutations > need to be either fixed or invalidated (or well documented). > 5) It would be nice if someone with comparable hardware could reproduce > these issues. FWIW, I can only recreate the crash running windows backup > to a cifs share. Copying large files doesn't seem to do it. Could also > be some other interaction going on here that perhaps others aren't > running - would be happy to compare notes. > > Notes: > This *could* be coincidental, but maybe not... > With MMAP+NO DMAR + disable_msi there are far fewer ... actually almost > no... bind error reports... and no bind format error messages. With > NOMMAP and alternative one there are a few more bind error messages and > one format error message during the several hours that version was up. > All other configurations going back perhaps for two weeks have > significantly more bind error reports - and all versions show increasing > frequency of bind format errors (IPV6 only) in the roughly 10-15 minutes > preceding the lockup/crash/interrupt error messages. There are none > immediately preceding any crash, but perhaps there is some correlation > between the network errors and bind ipv6 packets. OK, for now let's make sure this MMAP + NO DMAR + disable_msi is really really working. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on 8 Jan 2010 16:50 On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > On Fri, Jan 08, 2010 at 11:40:25AM -0500, Michael Breuer wrote: > >> On 1/8/2010 2:45 AM, Jarek Poplawski wrote: >> >> ... > Berck Nash reported oopses during sky2 TX timeout recovery, which are > generally hardware/driver problems, and shouldn't be triggered by ip > level bugs, so it should be queried as a separate bug report. > > My thought was that his crash was secondary to the netdev watchdog & subsequent reset that I saw. >> Will try rerunning without disable_msi later (after I catch the dns >> thing in the sniffer). >> >>>> I'm leaving this one running for now. Not retesting jumbo for now. Be >>>> happy to help dig further. >>>> >>>> Tentative recommendations: >>>> >>>> 1) The af alternative patch seems rather necessary. First alternative >>>> seems to be working, I'd suggest that be submitted and backported to >>>> 2.6.32. >>>> > BTW, don't hurry with that yet, but in the next test, please try > alternative 2 again (i.e. with MMAP + no DMAR + disable_msi). > > Will do - still up from yesterday... no more dropped packets... none of the dns errors either. To be expected I suppose as long as I'm trying to sniff it. Assuming no immediate erorrs with alt2, no DMAR + disable_msi I'll report back after it's been up for a while. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on 8 Jan 2010 17:10
On Fri, Jan 08, 2010 at 04:48:11PM -0500, Michael Breuer wrote: > On 1/8/2010 4:29 PM, Jarek Poplawski wrote: > >On Fri, Jan 08, 2010 at 11:40:25AM -0500, Michael Breuer wrote: > >>On 1/8/2010 2:45 AM, Jarek Poplawski wrote: > >>... > >Berck Nash reported oopses during sky2 TX timeout recovery, which are > >generally hardware/driver problems, and shouldn't be triggered by ip > >level bugs, so it should be queried as a separate bug report. > > > My thought was that his crash was secondary to the netdev watchdog & > subsequent reset that I saw. Yes, the netdev watchdog which triggers on TX timeout. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |