Prev: dspbridge: replace iommu custom for opensource implementation
Next: Oops while running fs_racer test on a POWER6 box against latest git
From: hobure66 on 2 Jul 2010 02:40 Hi all! I am debugging a strange interrupt problem with multiport serial cards and I am stuck with it. Suggestions are welcome! There are two machines mostly identical, let's call them Production and Testing. In Production there is a MOXA 8 port serial card sitting in a PCI express slot. In Testing we have the same MOXA card and a second, 4 port card from digitus. I have no access to the Production machine but I am free to make experiments on the Testing machine. Production kernel is 2.6.28.8-rt16. Both machines are 64-bit x86 with 2 cores, but the kernel is compiled with CONFIG_x86_32=y for whatever reason (I am not allowed to change this on the Production machine). Both cards have their own serial drivers, but they implement standard UART chips. The original issue found on the Production machine is that the MOXA card locks up after days of use. When this happens, the interrupt counter in /proc/interrupts is not increased when bytes are received on the line. I tried to reproduce this on the Testing machine, and after one and a half day experimenting with different setup of seemingly irrelevant things I could reliably reproduce the same (or a very similar) problem with the digitus card. I believe if I find the solution for this problem, the same solution would help on the remote Production machine as well. I connected one of the ports of the digitus card with a port of the other card on the same machine, using 3 wires (gnd, rx/tx cross-connected). After boot, I load the kernel driver, set baud rate to 115200, 8 bit, no parity. I start sending bytes as fast as possible in both directions on the cable, also reading the input on both ports. The digitus card locks up after some time, it varies from boot to boot. Interrupt counter is not increased. Unloading and reloading the card's kernel module doesn't help, only reboot helps. I could not reproduce the same problem with the same kernel version/config without the real-time patch, but I am not sure if it's just a coincidence. I started debugging by looking at UART registers during such a lockup. It seems the chip has an interrupt pending, but as no interrupt was delivered to the kernel module, it doesn't poll the registers. My first idea was that an interrupt is lost by the driver and the UART is not polled so it stays in interrupt pending mode; if we have interrupt pending already on the UART, a new interrupt won't be generated (right?), so the card will be waiting for the driver to poll the registers clearing the interrupt pending flag while the driver will be waiting for an interrupt to do anything. Fortunately if I stop all reading and writing processes for a port, the kernel module cleans up the UART and there's no interrupt pending anymore. During the lockup, after this I still don't get any new interrupt even if the UART chip goes into interrupt pending mode again. To see what happens with the interrupts added printk()s all around in do_IRQ, and handle_fasteoi_irq down to the card's driver. Looking at the sequences, nothing special happens in the last few interrupts before the lockup. I also write a little test script that uses direct port IO to query and set UART ports. I switch the UART in loopback mode so I can ensure it really works, I send a few bytes and read those bytes back. Meanwhile I get at least one transition from no interrupt pending to interrupt pending on the UART. During normal operation this increases the interrupt counter but after a lockup no increase, with or without the digitus kernel module loaded. My first question is whether my following conclusion makes sense: in such a lockup the UART chip generates an event that should cause an interrupt, but the interrupt is not generated by the card or is not delivered to the CPU or is not collected by the kernel. Second question: any suggestion how to go on debugging this? Maybe a way to poll APIC settings on the fly? Or generate artificial interrupt on the APIC somehow to see if it gets to the kernel? Or is there an interrupt counter in the APIC I could look at? TIA, Tibor Palinkas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |