From: hobure66 on
Hi all!

I am debugging a strange interrupt problem with multiport serial cards
and I am stuck with it. Suggestions are welcome!

There are two machines mostly identical, let's call them Production and
Testing. In Production there is a MOXA 8 port serial card sitting in a
PCI express slot. In Testing we have the same MOXA card and a second, 4
port card from digitus. I have no access to the Production machine but I
am free to make experiments on the Testing machine. Production kernel is
2.6.28.8-rt16. Both machines are 64-bit x86 with 2 cores, but the kernel
is compiled with CONFIG_x86_32=y for whatever reason (I am not allowed
to change this on the Production machine). Both cards have their own serial
drivers, but they implement standard UART chips.

The original issue found on the Production machine is that the MOXA card
locks up after days of use. When this happens, the interrupt counter in
/proc/interrupts is not increased when bytes are received on the line.

I tried to reproduce this on the Testing machine, and after one and a
half day experimenting with different setup of seemingly irrelevant
things I could reliably reproduce the same (or a very similar) problem
with the digitus card. I believe if I find the solution for this
problem, the same solution would help on the remote Production machine
as well.

I connected one of the ports of the digitus card with a port of the
other card on the same machine, using 3 wires (gnd, rx/tx
cross-connected). After boot, I load the kernel driver, set baud rate to
115200, 8 bit, no parity. I start sending bytes as fast as possible in
both directions on the cable, also reading the input on both ports. The
digitus card locks up after some time, it varies from boot to boot.
Interrupt counter is not increased. Unloading and reloading the card's
kernel module doesn't help, only reboot helps. I could not reproduce the
same problem with the same kernel version/config without the real-time patch,
but I am not sure if it's just a coincidence.

I started debugging by looking at UART registers during such a lockup.
It seems the chip has an interrupt pending, but as no interrupt was
delivered to the kernel module, it doesn't poll the registers.

My first idea was that an interrupt is lost by the driver and the UART
is not polled so it stays in interrupt pending mode; if we
have interrupt pending already on the UART, a new interrupt won't be
generated (right?), so the card will be waiting for the driver to poll
the registers clearing the interrupt pending flag while the driver will
be waiting for an interrupt to do anything. Fortunately if I stop all
reading and writing processes for a port, the kernel module cleans up
the UART and there's no interrupt pending anymore. During the
lockup, after this I still don't get any new interrupt even if the UART
chip goes into interrupt pending mode again.

To see what happens with the interrupts added printk()s all around in
do_IRQ, and handle_fasteoi_irq down to the card's driver. Looking at
the sequences, nothing special happens in the last few interrupts
before the lockup.

I also write a little test script that uses direct port IO to query
and set UART ports. I switch the UART in loopback mode so I can ensure
it really works, I send a few bytes and read those bytes back. Meanwhile
I get at least one transition from no interrupt pending to interrupt
pending on the UART. During normal operation this increases the interrupt
counter but after a lockup no increase, with or without the digitus
kernel module loaded.

My first question is whether my following conclusion makes sense: in
such a lockup the UART chip generates an event that should cause an
interrupt, but the interrupt is not generated by the card or is not
delivered to the CPU or is not collected by the kernel. Second question:
any suggestion how to go on debugging this? Maybe a way to poll APIC
settings on the fly? Or generate artificial interrupt on the APIC
somehow to see if it gets to the kernel? Or is there an interrupt
counter in the APIC I could look at?



TIA,

Tibor Palinkas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/