What happened to computer architecture (and comp.arch?) [Computer Architecture]

Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]

From: Mayan Moudgill on 5 Oct 2009 12:29

Morten Reistad wrote:

> I have a strong stomack feeling there is something going on
> regarding l2 cache hit rate.

Can you get your hands on the TLB miss rate somehow? It's a very poor
proxy for l2 cache misses, but it can sometimes give you insight.

From: nmm1 on 5 Oct 2009 12:34

In article <gefpp6-9p2.ln1(a)laptop.reistad.name>,
Morten Reistad <first(a)last.name> wrote:
>
>The interrupt-coalescing code helps bring the interrupt rate
>down by an order of magnitude, so the interrupt rate is not
>a showstopper anymore.
>
>I have a strong stomack feeling there is something going on
>regarding l2 cache hit rate.

They are often related. The most common interrupt problem is that
the cache or TLB pollution is sufficiently bad that a critical
event doesn't get handled in time, though that can also happen
without interrupts being involved. If that happens, the code very
often switches from the fast path into a slow one, which can cause
its response to be delayed by much more than its request. That can
then cause the next level to miss an event, and so the phenomenon
builds up.

Regards,
Nick Maclaren.

From: Terje Mathisen on 6 Oct 2009 01:38

Morten Reistad wrote:
> However, all of the applications just open general udp sockets,
> and read the mass of udp packets arriving. Which cpu that service
> any read request should be pretty random, but the packet handling
> needs to look up a few things, and will therefore hit cache locations
> that were in, with odds of 12:1 or somesuch, last in use by another
> cpu.

If these lookup locations are read-only, then there's nothing stopping
all 12 cores from having private copies in L1/L2. Except for the wasted
spcae due to duplication, this should be OK.

OTOH, if you update _anything_ within that common block, then all bets
are off.
>
> This may explain why we get such tremendous boosts with hyperchannel,
> or keeping all the cores on one socket. On the two-socket machine
> without hyperchannel that we tested the it proved essential for
> speed to put all the packet processing; rtp and linux kernel, on
> one cpu socket, and everything else on another.

Packet processing could well have false sharing, where independent
streams of data cause updates to data structures co-located in the same
cache lines.

OTOH this must be a problem on _any_ large smp cluster, so I would hope
the os architects are working on making sure such kernel structures are
separated with at least a cache line distance between them.
>
> This is why I want to see some cpu counters for cache misses.

The emon counters should be able to give you numbers for
evictions/contention as well.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Anne & Lynn Wheeler on 5 Oct 2009 14:42

Morten Reistad <first(a)last.name> writes:
> The interrupt-coalescing code helps bring the interrupt rate
> down by an order of magnitude, so the interrupt rate is not
> a showstopper anymore.
>
> I have a strong stomack feeling there is something going on
> regarding l2 cache hit rate.

"811" (i.e. March 1978) architecture allowed for stacking ending status
on queue ... showed up with 370-xa in 3081s in the early 80s ... as well
as placing outgoing requests on queue ... aka scenario to immediately
take a interrupt ... was so that resource could be redriving with any
pending requests could be redriven ... minimizing I/O resource idle
time, "811" addressed both I/O interrupts trashing cache hit ratio as
well as eliminating requiring processor synchronous participation in i/o
"redrive".

part of "811" was hardware interface for tracking busy & idle (for
things like capacity planning) ... which had previously been done by
software when kernel was involved in interrupts & redrive.

start subchannel ("811" instruction for queuing i/o request)
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.9?SHELF=DZ9ZBK03&DT=20040504121320

set channel monitor ("811" instruction for measurement & statistics)
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.8?SHELF=DZ9ZBK03&DT=20040504121320

test pending interruption ("811" instruction for i/o completion w/o
interrupt)
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.8?SHELF=DZ9ZBK03&DT=20040504121320

before 811 ... early 70s, i had modified my resourcer manager (on plain
vanilla 370s) to monitor interrupt rate and dynamically switch for
running enabled for i/o interrupts to disabling for i/o interrupts and
only doing periodic "batch" drain of interrupts ... attempting to
preserve cache locality.

a two processor smp was offered for some models of 370 where only one of
the processors had i/o capability. I had done an internal version of SMP
support that were deployed in some places ... where I actually got
higher aggregate MIP thruput than two single processors. Normally for
two-processor 370 SMP ... the clock was slowed by 10% to provide head
room for the processor caches to listen for cache invalidates (from the
other cache) ... this resulted in two processor SMP having nominal 1.8
times a single processor (handling of any cache invalidates signals
slowed it down further ... and any software SMP overhead slowed
two-processor SMP even further ... compared to two single processor
machines).

In any case, with lots of tuning of SMP pathlengths ... and tweaks of
how I/O and I/O interrupts were handled ... I got two processor SMP
configuration up to better than twice thruput of two single processor
machines (rule-of-the-thumb at the time said it should have only 1.3-1.5
times the thruput) ... basically because of preserving cache hit ratio.

another part of "811" architecture was to eliminate overhead of passing
thru the kernel for subsystem (demon) calls by applications. basically
hardware table was defined with address space pointer and privileges for
subsystems. application calls to subsystems then became very much like
simple application call (for something in the applications address
space). The api tended to be pointer passing ... so part of the
interface was having alternate address space pointers ... and
instructions where subsystems could directly access parameter values
(indicated by passed pointer) back in the application address space.

part of that 811 architecture description in current 64-bit
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/5.7?SHELF=DZ9ZBK03&DT=20040504121320

"program call" instruction
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/10.34?SHELF=DZ9ZBK03&DT=20040504121320

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970

From: Morten Reistad on 5 Oct 2009 15:06

In article <had760$vhk$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <gefpp6-9p2.ln1(a)laptop.reistad.name>,
>Morten Reistad <first(a)last.name> wrote:
>>
>>The interrupt-coalescing code helps bring the interrupt rate
>>down by an order of magnitude, so the interrupt rate is not
>>a showstopper anymore.
>>
>>I have a strong stomack feeling there is something going on
>>regarding l2 cache hit rate.
>
>They are often related. The most common interrupt problem is that
>the cache or TLB pollution is sufficiently bad that a critical
>event doesn't get handled in time, though that can also happen
>without interrupts being involved. If that happens, the code very
>often switches from the fast path into a slow one, which can cause
>its response to be delayed by much more than its request. That can
>then cause the next level to miss an event, and so the phenomenon
>builds up.

We push this while watching packet loss. There is not much packet
loss at all. We measure a few streams closely and watch the loss
and interpacket jitter, and generate a synthetic MOS value. As
soon as this MOS value falls below 4.0 we stop adding more atreams.

We raraly have problems with packet loss at all in these tests,
we just have cease adding streams because the jitter goes too high.
If we then keep adding load we will have loss around 10% further
out.

With the interrupt coalescing code losing interrupts on a steady
stream is not a problem, as long as we transmit or receive every
25 interrupts or so.

This is why I want to see actual CPU interrupts and cache misses,
separate from what the Linux drivers tell me.

-- mrr

First | Prev | Next | Last
Pages: 22 23 24 25 26 27 28 29 30 31 32 33 34
Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]