- Potential performance bottleneck for Linxu TCP [Kernel]

Prev: Add IDE mode support for SB600 SATA
Next: 2.6 driver for Silan SC92031 (second try)

From: David Miller on 30 Nov 2006 01:40

From: Ingo Molnar <mingo(a)elte.hu>
Date: Thu, 30 Nov 2006 07:17:58 +0100

>
> * David Miller <davem(a)davemloft.net> wrote:
>
> > We can make explicitl preemption checks in the main loop of
> > tcp_recvmsg(), and release the socket and run the backlog if
> > need_resched() is TRUE.
> >
> > This is the simplest and most elegant solution to this problem.
>
> yeah, i like this one. If the problem is "too long locked section", then
> the most natural solution is to "break up the lock", not to "boost the
> priority of the lock-holding task" (which is what the proposed patch
> does).

Ingo you're mis-read the problem :-)

The issue is that we actually don't hold any locks that prevent
preemption, so we can take preemption points which the TCP code
wasn't designed with in-mind.

Normally, we control the sleep point very carefully in the TCP
sendmsg/recvmsg code, such that when we sleep we drop the socket
lock and process the backlog packets that accumulated while the
socket was locked.

With pre-emption we can't control that properly.

The problem is that we really do need to run the backlog any time
we give up the cpu in the sendmsg/recvmsg path, or things get real
erratic. ACKs don't go out as early as we'd like them to, etc.

It isn't easy to do generically, perhaps, because we can only
drop the socket lock at certain points and we need to do that to
run the backlog.

This is why my suggestion is to preempt_disable() as soon as we
grab the socket lock, and explicitly test need_resched() at places
where it is absolutely safe, like this:

if (need_resched()) {
/* Run packet backlog... */
release_sock(sk);
schedule();
lock_sock(sk);
}

The socket lock is just a by-hand binary semaphore, so it doesn't
block pre-emption. We have to be able to sleep while holding it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 30 Nov 2006 02:00

* David Miller <davem(a)davemloft.net> wrote:

> This is why my suggestion is to preempt_disable() as soon as we grab
> the socket lock, [...]

independently of the issue at hand, in general the explicit use of
preempt_disable() in non-infrastructure code is quite a heavy tool. Its
effects are heavy and global: it disables /all/ preemption (even on
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU
data structures then [unlike for example to a spin-lock] the connection
between the 'data' and the 'lock' is not explicit - causing all kinds of
grief when trying to convert such code to a different preemption model.
(such as PREEMPT_RT :-)

So my plan is to remove all "open-coded" use of preempt_disable() [and
raw use of local_irq_save/restore] from the kernel and replace it with
some facility that connects data and lock. (Note that this will not
result in any actual changes on the instruction level because internally
every such facility still maps to preempt_disable() on non-PREEMPT_RT
kernels, so on non-PREEMPT_RT kernels such code will still be the same
as before.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 30 Nov 2006 02:00

* David Miller <davem(a)davemloft.net> wrote:

> > yeah, i like this one. If the problem is "too long locked section",
> > then the most natural solution is to "break up the lock", not to
> > "boost the priority of the lock-holding task" (which is what the
> > proposed patch does).
>
> Ingo you're mis-read the problem :-)

yeah, the problem isnt too long locked section but "too much time spent
holding a lock" and hence opening up ourselves to possible negative
side-effects of the scheduler's fairness algorithm when it forces a
preemption of that process context with that lock held (and forcing all
subsequent packets to be backlogged).

but please read my last mail - i think i'm slowly starting to wake up
;-) I dont think there is any real problem: a tweak to the scheduler
that in essence gives TCP-using tasks a preference changes the balance
of workloads. Such an explicit tweak is possible already.

furthermore, the tweak allows the shifting of processing from a
prioritized process context into a highest-priority softirq context.
(it's not proven that there is any significant /net win/ of performance:
all that was proven is that if we shift TCP processing from process
context into softirq context then TCP throughput of that otherwise
penalized process context increases.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Miller on 30 Nov 2006 02:20

From: Ingo Molnar <mingo(a)elte.hu>
Date: Thu, 30 Nov 2006 07:47:58 +0100

> furthermore, the tweak allows the shifting of processing from a
> prioritized process context into a highest-priority softirq context.
> (it's not proven that there is any significant /net win/ of performance:
> all that was proven is that if we shift TCP processing from process
> context into softirq context then TCP throughput of that otherwise
> penalized process context increases.)

If we preempt with any packets in the backlog, we send no ACKs and the
sender cannot send thus the pipe empties. That's the problem, this
has nothing to do with scheduler priorities or stuff like that IMHO.
The argument goes that if the reschedule is delayed long enough, the
ACKs will exceed the round trip time and trigger retransmits which
will absolutely kill performance.

The only reason we block input packet processing while we hold this
lock is because we don't want the receive queue changing from
underneath us while we're copying data to userspace.

Furthermore once you preempt in this particular way, no input
packet processing occurs in that socket still, exacerbating the
situation.

Anyways, even if we somehow unlocked the socket and ran the backlog at
preemption points, by hand, since we've thus deferred the whole work
of processing whatever is in the backlog until the preemption point,
we've lost our quantum already, so it's perhaps not legal to do the
deferred processing as the preemption signalling point from a fairness
perspective.

It would be different if we really did the packet processing at the
original moment (where we had to queue to the socket backlog because
it was locked, in softirq) because then we'd return from the softirq
and hit the preemption point earlier or whatever.

Therefore, perhaps the best would be to see if there is a way we can
still allow input packet processing even while running the majority of
TCP's recvmsg(). It won't be easy :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 30 Nov 2006 02:40

* David Miller <davem(a)davemloft.net> wrote:

> > furthermore, the tweak allows the shifting of processing from a
> > prioritized process context into a highest-priority softirq context.
> > (it's not proven that there is any significant /net win/ of
> > performance: all that was proven is that if we shift TCP processing
> > from process context into softirq context then TCP throughput of
> > that otherwise penalized process context increases.)
>
> If we preempt with any packets in the backlog, we send no ACKs and the
> sender cannot send thus the pipe empties. That's the problem, this
> has nothing to do with scheduler priorities or stuff like that IMHO.
> The argument goes that if the reschedule is delayed long enough, the
> ACKs will exceed the round trip time and trigger retransmits which
> will absolutely kill performance.

yes, but i disagree a bit about the characterisation of the problem. The
question in my opinion is: how is TCP processing prioritized for this
particular socket, which is attached to the process context which was
preempted.

normally, normally quite a bit of TCP processing happens in a softirq
context (in fact most of it happens there), and softirq contexts have no
fairness whatsoever - they preempt whatever processing is going on,
regardless of any priority preferences of the user!

what was observed here were the effects of completely throttling TCP
processing for a given socket. I think such throttling can in fact be
desirable: there is a /reason/ why the process context was preempted: in
that load scenario there was 10 times more processing requested from the
CPU than it can possibly service. It's a serious overload situation and
it's the scheduler's task to prioritize between workloads!

normally such kind of "throttling" of the TCP stack for this particular
socket does not happen. Note that there's no performance lost: we dont
do TCP processing because there are /9 other tasks for this CPU to run/,
and the scheduler has a tough choice.

Now i agree that there are more intelligent ways to throttle and less
intelligent ways to throttle, but the notion to allow a given workload
'steal' CPU time from other workloads by allowing it to push its
processing into a softirq is i think unfair. (and this issue is
partially addressed by my softirq threading patches in -rt :-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Add IDE mode support for SB600 SATA
Next: 2.6 driver for Silan SC92031 (second try)