From: Andi Kleen on
On Thu, Mar 25, 2010 at 12:08:23AM +0100, Thomas Gleixner wrote:
> On Wed, 24 Mar 2010, Thomas Gleixner wrote:
>
> > On Wed, 24 Mar 2010, Andi Kleen wrote:
> >
> > > Prevent nested interrupts when the IRQ stack is near overflowing v2
> > >
> > > Interrupts can always nest when they don't run with IRQF_DISABLED.
> > >
> > > When a lot of interrupts hit the same vector on the same
> > > CPU nested interrupts can overflow the irq stack and cause hangs.
> That's utter nonsense. An interrupt storm on the same vector does not
> cause irq nesting. The irq code prevents reentering a handler and in

Sorry it's the same CPU, not the same vector. Yes the reference
to same vector was misleading.

"
Multiple vectors on a multi port NIC pointing to the same CPU,
all hitting the irq stack until it overflows.
"

> case of MSI-X it just disables the IRQ when it comes again while the
> first irq on that vector is still in progress. So the maximum nesting
> is two up to handle_edge_irq() where it disables the IRQ and returns
> right away.

Real maximum nesting is all IRQs running with interrupts on pointing
to the same CPU. Enough from multiple busy IRQ sources and you go boom.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> Pretty much the only 'core' driver today which enables IRQs in the irq
> handlers and needs it is the old IDE layer. There are also a couple of

Thanks.

I'm tempted to just ignore it in this case, but in theory it might
still have troubles if there are a lot of interrupts to the same CPU.

I've only had a report on a very large system with a very high interrupt
rate on a very fast NIC though, so presumably it's not too common.

Anyways this patch will fix the problem for all drivers that do
not explicitely enable interrupts, which is the overwhelming majority.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
On Thu, Mar 25, 2010 at 12:16:17PM +0100, Thomas Gleixner wrote:
> > Pretty much the only 'core' driver today which enables IRQs in the irq
> > handlers and needs it is the old IDE layer. There are also a couple of
> > drivers which play games with disable/enable_irq in the IRQ paths for
> > other reasons (lack of irq threads when written and a hardware model thats
> > totally SMP unfriendly). 8390 is the obvious one here and it at least
> > would be far far saner using threaded IRQs and normal locking with IRQs
> > unmasked.
>
> Right, but that's not the problem here. We talk about a (hopefully)
> well written interrupt handler which runs for a very short
> time.

The NIC handlers can do quite some work under high traffic.
Even with interrupt mitigation and NAPI.

> What's the point of running it with interrupts enabled ?

Other interrupts.

> Nothing, we just run into stack overflow problems. So what's better:
> an unreliable and ugly hackaround

I don't think that's a accurate description of the patch at all.
Besides I believe it's reliable in all cases that matter.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> > Well it's simply the current state of affairs today. I'm merely
> > attempting to make the current state slightly safer without breaking
> > anything in the process.
>
> Well, I'd agree if those stack overflows would be a massive reported
> problem.

At least the people who reported it to me thought it was a massive
problem @)

> Right now they happen with a weird test case which points out a
> trouble spot. Multi vector NICs under heavy load. So why not go there
> and change the handful of drivers to run their handlers with irqs
> disabled?

Ok, but afaik it's not that small a number: MSI-X support is getting
more and more wide spread. It's pretty universal in the 10+GbitE space
for space and starts getting deployed for block too.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
> I think the patch as posted solves a real problem, but also perpetuates a bad
> situation.
>
> At minimum we should print a (one-time) warning that some badness occured.
> That would push us either in the direction of improving drivers, or towards
> improving the generic code.

What should a driver do to prevent that? I don't see what it could do
short of castrating itself (like refusing to use multiple ports)
As Linus says the driver doesn't know if setting IRQF_DISABLED is safe.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/