Prev: linux-next: Tree for November 27
Next: [PATCH 2/3] sched: make tunable scaling style configurable
From: Alan Stern on 1 Dec 2009 10:20 On Tue, 1 Dec 2009, Ondrej Zary wrote: > On Monday 30 November 2009, Alan Stern wrote: > > On Mon, 30 Nov 2009, Ondrej Zary wrote: > > > It does not make much sense to me but I think that it crashes iside this > > > list manipulation: > > > > > > prev = ehci->async; > > > while (prev->qh_next.qh != qh) > > > prev = prev->qh_next.qh; > > > > Yes, it's crashing in the "while" test because prev is NULL. This > > means the code is looking for qh in the async list but not finding it. > > That's supposed to be impossible. > > > > The assembly code is peculiar because it includes stuff that isn't in > > the source code! For example, right at this point (after the end of > > the loop) there's a test to see whether prev is NULL. Where could that > > have come from? Do you have any idea? > > I'm not sure, I might did something wrong and left it there from my previous > debugging attempt. Okay. But it was part of the reason you had difficulty matching up the ehci-hcd.s file with the crash code dump. > > > prev->hw_next = qh->hw_next; > > > prev->qh_next = qh->qh_next; > > > wmb (); > > > > These lines aren't reached. > > > > Does this happen every time you disconnect the Nexio? > > The crash happens almost always when disconnecting the touchscreen. > When booted without X, it often survives the first disconnect. Then it will be easy to recreate the problem. :-) > > You can try patching that loop. If prev is NULL then print an error > > message in the log, including the value of qh and the value of > > ehci->async, and jump past the following three statements. > > > > With that change the system shouldn't crash, although khubd might hang. > > But we still need to find out how this could have happened. Try > > collecting a usbmon trace while running the test; then let's compare > > the usbmon output with the error messages in the log. > > gcc version is: gcc (Debian 4.3.4-6) 4.3.4 > > Tried something like that before but it did not help at all. > The check is not triggered and it still oopses. Now it looks like this: > > qh->qh_state = QH_STATE_UNLINK; > ehci->reclaim = qh = qh_get (qh); > > prev = ehci->async; > if (!prev) { > printk("prev is NULL, qh=%p, ehci->async=%p\n", qh, ehci->async); > goto after; > } [*] > while (prev->qh_next.qh != qh) { > if (!prev) { > printk("prev is NULL, qh=%p, ehci->async=%p\n", qh, ehci->async); > goto after; > } > prev = prev->qh_next.qh; > } > > prev->hw_next = qh->hw_next; > prev->qh_next = qh->qh_next; > wmb (); > > after: No, that is wrong. The [*] line can still perform an invalid dereference. You need to move the inside test to the end of the loop, after the assignment statement. Or else do this: prev = ehci->async; for (;;) { if (!prev) { printk(KERN_ERR "prev is NULL, qh=%p, " "ehci->async=%p\n", qh, ehci->async); goto after; } if (prev->qh_next.qh == qh) break; prev = prev->qh_next.qh; } Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ondrej Zary on 2 Dec 2009 04:00 On Tuesday 01 December 2009, Alan Stern wrote: > On Tue, 1 Dec 2009, Ondrej Zary wrote: > > On Monday 30 November 2009, Alan Stern wrote: > > > On Mon, 30 Nov 2009, Ondrej Zary wrote: > > > > It does not make much sense to me but I think that it crashes iside > > > > this list manipulation: > > > > > > > > prev = ehci->async; > > > > while (prev->qh_next.qh != qh) > > > > prev = prev->qh_next.qh; > > > > > > Yes, it's crashing in the "while" test because prev is NULL. This > > > means the code is looking for qh in the async list but not finding it. > > > That's supposed to be impossible. > > > > > > The assembly code is peculiar because it includes stuff that isn't in > > > the source code! For example, right at this point (after the end of > > > the loop) there's a test to see whether prev is NULL. Where could that > > > have come from? Do you have any idea? > > > > I'm not sure, I might did something wrong and left it there from my > > previous debugging attempt. > > Okay. But it was part of the reason you had difficulty matching up the > ehci-hcd.s file with the crash code dump. > > > > > prev->hw_next = qh->hw_next; > > > > prev->qh_next = qh->qh_next; > > > > wmb (); > > > > > > These lines aren't reached. > > > > > > Does this happen every time you disconnect the Nexio? > > > > The crash happens almost always when disconnecting the touchscreen. > > When booted without X, it often survives the first disconnect. > > Then it will be easy to recreate the problem. :-) Yes, seems so. But it's not that easy (see below). > > > You can try patching that loop. If prev is NULL then print an error > > > message in the log, including the value of qh and the value of > > > ehci->async, and jump past the following three statements. > > > > > > With that change the system shouldn't crash, although khubd might hang. > > > But we still need to find out how this could have happened. Try > > > collecting a usbmon trace while running the test; then let's compare > > > the usbmon output with the error messages in the log. > > > > gcc version is: gcc (Debian 4.3.4-6) 4.3.4 > > > > Tried something like that before but it did not help at all. > > The check is not triggered and it still oopses. Now it looks like this: > > > > qh->qh_state = QH_STATE_UNLINK; > > ehci->reclaim = qh = qh_get (qh); > > > > prev = ehci->async; > > if (!prev) { > > printk("prev is NULL, qh=%p, ehci->async=%p\n", qh, > > ehci->async); goto after; > > } > > [*] > while (prev->qh_next.qh != qh) { > > > if (!prev) { > > printk("prev is NULL, qh=%p, ehci->async=%p\n", > > qh, ehci->async); goto after; > > } > > prev = prev->qh_next.qh; > > } > > > > prev->hw_next = qh->hw_next; > > prev->qh_next = qh->qh_next; > > wmb (); > > > > after: > > No, that is wrong. The [*] line can still perform an invalid > dereference. You need to move the inside test to the end of the loop, > after the assignment statement. Or else do this: Oops, such a stupid mistake. Fixed now: prev = ehci->async; if (!prev) { printk("prev is NULL, qh=%p, ehci->async=%p\n", qh, ehci->async); goto after; } while (prev->qh_next.qh != qh) { prev = prev->qh_next.qh; if (!prev) { printk("prev is NULL, qh=%p, ehci->async=%p\n", qh, ehci->async); goto after; } } prev->hw_next = qh->hw_next; prev->qh_next = qh->qh_next; wmb (); after: It shows "prev is NULL, qh=f6581080, ehci->async=f6581000". The problem is that activating usbmon causes the problem to disappear. No errors in maybe 15 attempts. It appeared on 2nd attempt after unloading usbmon. -- Ondrej Zary -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Oliver Neukum on 2 Dec 2009 04:50 Am Mittwoch, 2. Dezember 2009 09:52:08 schrieb Ondrej Zary: > It shows "prev is NULL, qh=f6581080, ehci->async=f6581000". > > The problem is that activating usbmon causes the problem to disappear. No > errors in maybe 15 attempts. It appeared on 2nd attempt after unloading > usbmon. That suggests a use after free bug, which usbmon prevents holding an additional reference. Are you using SLAB_DEBUG? Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Alan Stern on 2 Dec 2009 11:00 On Wed, 2 Dec 2009, Ondrej Zary wrote: > It shows "prev is NULL, qh=f6581080, ehci->async=f6581000". > > The problem is that activating usbmon causes the problem to disappear. No errors > in maybe 15 attempts. It appeared on 2nd attempt after unloading usbmon. Nevertheless, the usbmon output might be useful. Try going through a couple of cycles with usbmon installed, then unload usbmon and create a crash. Also, try adding some more debugging output (and let's hope it doesn't also make the problem disappear). In start_unlink_async(), just before your "after:" label, add ehci_info(ehci, "unlink qh %p %p\n", qh, qh->qh_next); In qh_link_async(), just after the wmb(), add ehci_info(ehci, "link qh %p %p\n", qh, qh->qh_next); In end_unlink_async(), just after the iaa_watchdog_done(ehci), add ehci_info(ehci, "end unlink qh %p %p\n", qh, qh->next); And in qh_make(), just before the end, add ehci_info(ehci, "create qh %p, dev %s, ep %x\n", qh, urb->dev->devpath, urb->ep->desc.bEndpointAddress); Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ondrej Zary on 3 Dec 2009 04:40
On Wednesday 02 December 2009, Oliver Neukum wrote: > Am Mittwoch, 2. Dezember 2009 09:52:08 schrieb Ondrej Zary: > > It shows "prev is NULL, qh=f6581080, ehci->async=f6581000". > > > > The problem is that activating usbmon causes the problem to disappear. No > > errors in maybe 15 attempts. It appeared on 2nd attempt after unloading > > usbmon. > > That suggests a use after free bug, which usbmon prevents holding > an additional reference. > Are you using SLAB_DEBUG? Just tried 2.6.32-rc8-next-20091202 with SLAB_DEBUG enabled and got this: BUG: unable to handle kernel paging request at 6b6b6b6b IP: [<f7c649a3>] qh_completions+0x6c/0x400 Call Trace: lapic_next_event ehci_work tick_dev_program_event tick_program_event hrtimer_interrupt ehci_watchdog+0x0/0x37 ehci_watchdog+0x2b/0x37 run_timer_softirq __do_softirq irq_exit do_IRQ common_interrupt acpi_idle_enter_bm cpuidle_idle_call cpu_idle start_kernel Code: 00 00 00 00 3c 03 88 44 24 1f 0f 94 c0 81 ea ec 00 00 00 0f b6 c0 89 44 24 14 89 14 24 8b 46 0c 31 db bd 8d ff ff ff c6 46 28 00 <8b> 08 89 4c 24 10 e9 ec 02 00 00 83 e8 38 85 db 89 44 24 24 8b -- Ondrej Zary -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |