Prev: dw_spi_mmio depends on HAVE_CLK
Next: [PATCH] platform_driver_register: warn if probe is in .init.text
From: Jarek Poplawski on 22 Jan 2010 18:50 On Fri, Jan 22, 2010 at 06:25:12PM -0500, Michael Breuer wrote: > On 1/22/2010 6:06 PM, Jarek Poplawski wrote: > >On Fri, Jan 22, 2010 at 05:14:58PM -0500, Michael Breuer wrote: > >>Not sure I can do that. Note that based on the log messages, there > >>were no errors/dropped packets involving dhcp. Moving the dhcp > >>server off of the affected machine is not trivial. The dhcp > >>correlation is based on logged messages preceding each crash. I > >>cannot confirm that they're related, however it's really suspicious. > >>If it helps, HP replaced my unmanaged switch with a managed one so I > >>can see whether there were any switch events logged the next time I > >>have a crash. > >> > >>At this point, it seems the following is required to trigger the crash: > >>1) Uptime of 24-36 hours > >>2) High RX load on server (cifs traffic is what I've triggered it with). > >>3) Normal DHCP traffic. > >Do you mean you got these crashes with the new switch too, and this > >switch doesn't drop DHCP at all? (Otherwise, let's try this switch > >first.) > > > >Jarek P. > Nope - just got the new switch. Crash was old switch. That said, I > don't think (based on the log messages) that the dhcpoffer packet > drop was happening prior to the crash. I also can't fathom why a > DHCPOFFER packet dropped after leaving the server would have any > bearing on the issue. You wrote earlier: > [...] Also, there is always a dhcp exchange of some sort > preceding the event. So, I'm not sure there was "3) Normal DHCP traffic." if the switch could drop DHCP packets in some buggy conditions. Anyway, let's try the new one with really "3) Normal DHCP traffic.", I hope. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on 22 Jan 2010 19:00 On 1/22/2010 6:46 PM, Jarek Poplawski wrote: > On Fri, Jan 22, 2010 at 06:25:12PM -0500, Michael Breuer wrote: > > > You wrote earlier: > >> [...] Also, there is always a dhcp exchange of some sort >> preceding the event. >> > So, I'm not sure there was "3) Normal DHCP traffic." if the switch > could drop DHCP packets in some buggy conditions. Anyway, let's try > the new one with really "3) Normal DHCP traffic.", I hope. > > Jarek P. > When the packets were dropped, there was a different sequence in the log - DISCOVER/OFFER repeated. The "normal" is that the sequence appeared correct and complete - DISCOVER/OFFER/REQUEST/ACK - or INFORM/ACK (vs. INFORM repeatedly sans ACK) as the case may be. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on 23 Jan 2010 18:30 On Fri, Jan 22, 2010 at 06:50:21PM -0500, Michael Breuer wrote: > When the packets were dropped, there was a different sequence in the > log - DISCOVER/OFFER repeated. The "normal" is that the sequence > appeared correct and complete - DISCOVER/OFFER/REQUEST/ACK - or > INFORM/ACK (vs. INFORM repeatedly sans ACK) as the case may be. Anyway, I'd be intersted if the switch matters here. Plus one more test: could you try to load sky2 with the parameter: "copybreak=1" (the rest as in any recent test, which gave you dmar errors; any switch). Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on 23 Jan 2010 21:00 On 1/23/2010 6:21 PM, Jarek Poplawski wrote: > On Fri, Jan 22, 2010 at 06:50:21PM -0500, Michael Breuer wrote: > >> When the packets were dropped, there was a different sequence in the >> log - DISCOVER/OFFER repeated. The "normal" is that the sequence >> appeared correct and complete - DISCOVER/OFFER/REQUEST/ACK - or >> INFORM/ACK (vs. INFORM repeatedly sans ACK) as the case may be. >> > Anyway, I'd be intersted if the switch matters here. > > Plus one more test: could you try to load sky2 with the parameter: > "copybreak=1" (the rest as in any recent test, which gave you dmar > errors; any switch). > > Thanks, > Jarek P. > Ok - will try with and without the copybreak=1 after I'm done bisecting 2.6.33 rc5 for something unrelated (it'll take a couple of days for each unless a crash occurs in less that 48 hours). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on 27 Jan 2010 10:40
On 01/23/2010 06:21 PM, Jarek Poplawski wrote: > On Fri, Jan 22, 2010 at 06:50:21PM -0500, Michael Breuer wrote: > >> When the packets were dropped, there was a different sequence in the >> log - DISCOVER/OFFER repeated. The "normal" is that the sequence >> appeared correct and complete - DISCOVER/OFFER/REQUEST/ACK - or >> INFORM/ACK (vs. INFORM repeatedly sans ACK) as the case may be. >> > Anyway, I'd be intersted if the switch matters here. > > Plus one more test: could you try to load sky2 with the parameter: > "copybreak=1" (the rest as in any recent test, which gave you dmar > errors; any switch). > > Thanks, > Jarek P. > Ok - now up 80+ hours with copybreak=1. I'm going to redo w/o copybreak to confirm that I haven't inadvertently fixed something. However, given that it might be copybreak-related, I looked at sky2.c again and I'm wondering about the copybreak max size in sky2_rx_start: size = roundup(sky2->netdev->mtu + ETH_HLEN + VLAN_HLEN, 8); /* Stopping point for hardware truncation */ thresh = (size - 8) / sizeof(u32); sky2->rx_nfrags = size >> PAGE_SHIFT; BUG_ON(sky2->rx_nfrags > ARRAY_SIZE(re->frag_addr)); /* Compute residue after pages */ size -= sky2->rx_nfrags << PAGE_SHIFT; /* Optimize to handle small packets and headers */ if (size < copybreak) size = copybreak; if (size < ETH_HLEN) size = ETH_HLEN; Why would increasing size to copybreak be valid here? Guessing a bit as I'm not sure about rx_nfrags, but if I read this correctly, if size is ever less than copybreak it's because there isn't enough space left for anything larger. If so, wouldn't increasing size potentially corrupt something? I'd further guess that the resulting condition manifests sooner (or at least with a more visible effect) when using DMAR. In any event, why "copybreak" as the minimum buffer size? I'd suggest that if it isn't possible to allocate at least MTU + overhead that sky2_rx_start ought to be delayed until there is room. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |