mutex: Fix optimistic spinning vs. BKL [Kernel]

Prev: [PATCH 5/8] PM: suspend_block: Add suspend_blocker stats
Next: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL

From: Benjamin Herrenschmidt on 7 May 2010 02:10

On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:
>
>
> I like the safeguard against the bkl, it looks indeed like something
> we should have in .34
>
> But I really don't like the timeout.

And I hate not having it :-)

> This is going to make the things even worse if we have another cause
> of deadlock by hiding the worst part of the consequences without
> actually solving the problem.

Yes and no. There's two reasons why the timeout is good. One is the
safeguard part and it's arguable whether it helps hiding bugs or not,
but there are very real cases where I believe we should get out and go
to sleep as well that aren't bugs.

IE. If the mutex owner is running for a long time and nobody is
contending on your CPU, you're simply not going to hit the
need_resched() test. That means you will spin, which means you will suck
a lot more power as well, not counting the potentially bad effect on
rebalance, load average etc...

There's also the fact that 2 CPUs or more trying to obtain it at once
may all go into spinning, which can lead to interesting results in term
of power consumption (and cpu_relax doesn't help that much).

I really don't think it's a good idea to turns mutex into potential
multi-jiffies spinning things like that.

> And since the induced latency or deadlock won't be easily visible
> anymore, we'll miss there is a problem. So we are going to spin for
> two jiffies and only someone doing specific latency measurements will
> notice, if he's lucky enough to meet the bug.

Well, the thing is that it may not be a bug.

The thing is that the actual livelock with the BKL should really only
happen with the BKL since that's the only thing we have that allows for
AB->BA semantics. Anything else should hopefully be caught by lockdep.

So I don't think there's that much to fear about hidden bugs.

But I -also- don't see the point of spinning potentially for a very long
time instead of going to sleep and saving power. The adaptive spinning
goal is to have an opportunistic optimization based on the idea that the
mutex is likely to be held for a very short period of time by its owner
and nobody's waiting for it yet. Ending up doing multi-jiffies spins
just doesn't fit in that picture. In fact, I was tempted to make the
timeout a lot shorter but decided against calling into clock sources
etc... and instead kept it simple with jiffies.

> Moreover that adds some unnessary (small) overhead in this path.

Uh ? So ? This is the contended path where we are .. spinning :-) The
overhead of reading jiffies and comparing here is simply never going to
show on any measurement I bet you :-)

> May be can we have it as a debugging option, something that would
> be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> support mutex adaptive spinning.

No, what I would potentially add as part of lockdep however is a way to
instrument how often we get out of the spin via the timeout. That might
be a useful information to figure out some of those runaway code path,
but they may well happen for very legit reasons and aren't a bug per-se.

> A debugging option that could just dump the held locks and the
> current one if we spin for an excessive timeslice.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mike Galbraith on 7 May 2010 02:20

On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:

> I like the safeguard against the bkl, it looks indeed like something
> we should have in .34
>
> But I really don't like the timeout.
>
> This is going to make the things even worse if we have another cause of
> deadlock by hiding the worst part of the consequences without actually
> solving the problem.
> And since the induced latency or deadlock won't be easily visible anymore,
> we'll miss there is a problem. So we are going to spin for two jiffies
> and only someone doing specific latency measurements will notice, if he's
> lucky enough to meet the bug.
>
> Moreover that adds some unnessary (small) overhead in this path.
>
> May be can we have it as a debugging option, something that would
> be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> support mutex adaptive spinning.
>
> A debugging option that could just dump the held locks and the
> current one if we spin for an excessive timeslice.

FYI, there's a case where OWNER_SPIN eats astounding amounts of CPU.

Run virgin AIM7 V1.1 with many many tasks and a nil workfile. When the
thing kicks off preparing for test, it forks of these many tasks, who
then all try to phone home via pipe. The result it horrible to behold.

There's a fix for AIM7's scalability problem, but the virgin code looks
like a decent OWNER_SPIN corner case test. It turns a 32 core smt box
into an expensive space heater at sufficiently high task count:)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 7 May 2010 17:30

On Fri, May 07, 2010 at 04:01:56PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:
> >
> >
> > I like the safeguard against the bkl, it looks indeed like something
> > we should have in .34
> >
> > But I really don't like the timeout.
>
> And I hate not having it :-)
>
> > This is going to make the things even worse if we have another cause
> > of deadlock by hiding the worst part of the consequences without
> > actually solving the problem.
>
> Yes and no. There's two reasons why the timeout is good. One is the
> safeguard part and it's arguable whether it helps hiding bugs or not,
> but there are very real cases where I believe we should get out and go
> to sleep as well that aren't bugs.
>
> IE. If the mutex owner is running for a long time and nobody is
> contending on your CPU, you're simply not going to hit the
> need_resched() test. That means you will spin, which means you will suck
> a lot more power as well, not counting the potentially bad effect on
> rebalance, load average etc...
>
> There's also the fact that 2 CPUs or more trying to obtain it at once
> may all go into spinning, which can lead to interesting results in term
> of power consumption (and cpu_relax doesn't help that much).
>
> I really don't think it's a good idea to turns mutex into potential
> multi-jiffies spinning things like that.

Yeah, I was considering only the deadlock issue.

Ok it's an problem on power consumption. That said I wonder if we have
so much places where a task can hold a mutex for two jiffies without
sleeping in-between, since when the owner goes to sleep, spinning is
stopped.

If that's an issue for power saving, it means we have places to fix.
Also I wonder if we should disable adaptive spinning for people who want
agressive power saving. At least it would be worth the test to check the
result.

Also, to solve this problem when several cpus may spin on the owner,
I wonder adaptive spinning is doing the right thing here. We should
have only one spinner I think, and the rest should go to sleep as the
time to spin on several subsequent owners would be much better gained
to do something else (schedule something else or power saving).
In fact, that too could deserve some tests.

> > And since the induced latency or deadlock won't be easily visible
> > anymore, we'll miss there is a problem. So we are going to spin for
> > two jiffies and only someone doing specific latency measurements will
> > notice, if he's lucky enough to meet the bug.
>
> Well, the thing is that it may not be a bug.
>
> The thing is that the actual livelock with the BKL should really only
> happen with the BKL since that's the only thing we have that allows for
> AB->BA semantics. Anything else should hopefully be caught by lockdep.
>
> So I don't think there's that much to fear about hidden bugs.
>
> But I -also- don't see the point of spinning potentially for a very long
> time instead of going to sleep and saving power. The adaptive spinning
> goal is to have an opportunistic optimization based on the idea that the
> mutex is likely to be held for a very short period of time by its owner
> and nobody's waiting for it yet. Ending up doing multi-jiffies spins
> just doesn't fit in that picture. In fact, I was tempted to make the
> timeout a lot shorter but decided against calling into clock sources
> etc... and instead kept it simple with jiffies.

Ok. But even under this power saving angle, I think this heuristic based on
timeout is still only hidding the problem. Or it lowers it more than it
solves it, which is quite better than keeping the things as is of course.
But if the real source is in places that hold mutexes without sleeping for
too much time, I think we should rather identify these places and solve
the ugly locking there.

Plus may be allowing only one spinner at the time.

But yeah until we reach a good solution for that, may be this timeout
would be worth.

> > Moreover that adds some unnessary (small) overhead in this path.
>
> Uh ? So ? This is the contended path where we are .. spinning :-) The
> overhead of reading jiffies and comparing here is simply never going to
> show on any measurement I bet you :-)

Yeah, probably in fact :)

> > May be can we have it as a debugging option, something that would
> > be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> > support mutex adaptive spinning.
>
> No, what I would potentially add as part of lockdep however is a way to
> instrument how often we get out of the spin via the timeout. That might
> be a useful information to figure out some of those runaway code path,
> but they may well happen for very legit reasons and aren't a bug per-se.

We could make a trace event for that.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Benjamin Herrenschmidt on 7 May 2010 18:30

> Yeah, I was considering only the deadlock issue.
>
> Ok it's an problem on power consumption. That said I wonder if we have
> so much places where a task can hold a mutex for two jiffies without
> sleeping in-between, since when the owner goes to sleep, spinning is
> stopped.
>
> If that's an issue for power saving, it means we have places to fix.
> Also I wonder if we should disable adaptive spinning for people who want
> agressive power saving. At least it would be worth the test to check the
> result.

I don't think it's only about agressive power saving. I'm not sure this
is directly related but somebody recently reported a case where pretty
much every CPU in the machine gets to spin on one holding the mutex, the
power consumption skyrockets and everything heats up at once. Not nice
even on servers on workloads where otherwise everybody would have gone
to sleep.

My feeling is that spinning should last only a very small amount of
time, ideally using a better clock than jiffies, maybe a few dozen ms at
max. But I decided to compromise for code simplicity and less overhead
by using jiffies instead. Maybe 1 jiffy delta would have been enough,
not two tho (which means a random timeout between 0 and 1 jiffie).

> Also, to solve this problem when several cpus may spin on the owner,
> I wonder adaptive spinning is doing the right thing here. We should
> have only one spinner I think, and the rest should go to sleep as the
> time to spin on several subsequent owners would be much better gained
> to do something else (schedule something else or power saving).
> In fact, that too could deserve some tests.

Right, the problem is due to the fact that we skip spinning if there's
already a waiter but we don't know that there is already a spinner so we
can end up with multiple spinners.

I don't see a non invasive way to fix that.. we could add a spinner
counter to the mutex but that sucks a bit. Might still be worthwhile,
not sure. Peter, what do you reckon ?

> Ok. But even under this power saving angle, I think this heuristic based on
> timeout is still only hidding the problem. Or it lowers it more than it
> solves it, which is quite better than keeping the things as is of course.
> But if the real source is in places that hold mutexes without sleeping for
> too much time, I think we should rather identify these places and solve
> the ugly locking there.

That's a job for lockdep or something similar :-) Doing so is
orthogonal, I don't see this contradicting with limiting the spinning
time. Again, I tend to have a different "view" of it as: spinning is an
oportunistic optimization based on the idea that the owner will release
the mutex soon, I don't see the point of spinning for a long time :-)

Note that power management is one side of the story. Another one is
relinguishing the CPU to the hypervisor so some other partition gets
some time. Etc... There's more than one consequence to ending up with
CPUs spinning instead of sleeping and they are not all trivial, so by
letting mutexes spin for potentially a "long time", we quite
fundamentally change the semantics of those mutexes with impacts in othe
areas that we don't necessarily see. This goes beyond the scope of a
simple optimization and becomes problematic.

Yes, we can to try to find all of these areas and address them one by
one, or we can use my simpler approach which is to keep the overall
semantic of mutexes which is to sleep ... and bound the optimisation to
what it should be: a short term spin to speed up short lived contention
cases.

But yeah, the argument is very valid that maybe we should spin even less
long and that maybe we should use a finer grained clock to limit the
spin to a few microseconds at most. The exact number remains to be
determined :-)

> Plus may be allowing only one spinner at the time.
>
> But yeah until we reach a good solution for that, may be this timeout
> would be worth.

Cheers,
Ben.

>
>
> > > Moreover that adds some unnessary (small) overhead in this path.
> >
> > Uh ? So ? This is the contended path where we are .. spinning :-) The
> > overhead of reading jiffies and comparing here is simply never going to
> > show on any measurement I bet you :-)
>
>
>
> Yeah, probably in fact :)
>
>
>
> > > May be can we have it as a debugging option, something that would
> > > be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> > > support mutex adaptive spinning.
> >
> > No, what I would potentially add as part of lockdep however is a way to
> > instrument how often we get out of the spin via the timeout. That might
> > be a useful information to figure out some of those runaway code path,
> > but they may well happen for very legit reasons and aren't a bug per-se.
>
>
>
> We could make a trace event for that.
>
> Thanks.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 10 May 2010 04:00

On Sat, 2010-05-08 at 08:27 +1000, Benjamin Herrenschmidt wrote:
> > Also, to solve this problem when several cpus may spin on the owner,
> > I wonder adaptive spinning is doing the right thing here. We should
> > have only one spinner I think, and the rest should go to sleep as the
> > time to spin on several subsequent owners would be much better gained
> > to do something else (schedule something else or power saving).
> > In fact, that too could deserve some tests.
>
> Right, the problem is due to the fact that we skip spinning if there's
> already a waiter but we don't know that there is already a spinner so we
> can end up with multiple spinners.
>
> I don't see a non invasive way to fix that.. we could add a spinner
> counter to the mutex but that sucks a bit. Might still be worthwhile,
> not sure. Peter, what do you reckon ?

If its a large problem the lock is overly contended and _that_ needs
fixing. I don't at all feel like adding atomic ops to the spin loop to
try and detect this.

As to the 2 jiffy spin timeout, I guess we should add a lockdep warning
for that, because anybody holding a mutex for longer than 2 jiffies and
not sleeping does need fixing anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [PATCH 5/8] PM: suspend_block: Add suspend_blocker stats
Next: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL