Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Mayan Moudgill on 5 Oct 2009 12:29 Morten Reistad wrote: > I have a strong stomack feeling there is something going on > regarding l2 cache hit rate. Can you get your hands on the TLB miss rate somehow? It's a very poor proxy for l2 cache misses, but it can sometimes give you insight.
From: nmm1 on 5 Oct 2009 12:34 In article <gefpp6-9p2.ln1(a)laptop.reistad.name>, Morten Reistad <first(a)last.name> wrote: > >The interrupt-coalescing code helps bring the interrupt rate >down by an order of magnitude, so the interrupt rate is not >a showstopper anymore. > >I have a strong stomack feeling there is something going on >regarding l2 cache hit rate. They are often related. The most common interrupt problem is that the cache or TLB pollution is sufficiently bad that a critical event doesn't get handled in time, though that can also happen without interrupts being involved. If that happens, the code very often switches from the fast path into a slow one, which can cause its response to be delayed by much more than its request. That can then cause the next level to miss an event, and so the phenomenon builds up. Regards, Nick Maclaren.
From: Terje Mathisen on 6 Oct 2009 01:38 Morten Reistad wrote: > However, all of the applications just open general udp sockets, > and read the mass of udp packets arriving. Which cpu that service > any read request should be pretty random, but the packet handling > needs to look up a few things, and will therefore hit cache locations > that were in, with odds of 12:1 or somesuch, last in use by another > cpu. If these lookup locations are read-only, then there's nothing stopping all 12 cores from having private copies in L1/L2. Except for the wasted spcae due to duplication, this should be OK. OTOH, if you update _anything_ within that common block, then all bets are off. > > This may explain why we get such tremendous boosts with hyperchannel, > or keeping all the cores on one socket. On the two-socket machine > without hyperchannel that we tested the it proved essential for > speed to put all the packet processing; rtp and linux kernel, on > one cpu socket, and everything else on another. Packet processing could well have false sharing, where independent streams of data cause updates to data structures co-located in the same cache lines. OTOH this must be a problem on _any_ large smp cluster, so I would hope the os architects are working on making sure such kernel structures are separated with at least a cache line distance between them. > > This is why I want to see some cpu counters for cache misses. The emon counters should be able to give you numbers for evictions/contention as well. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Anne & Lynn Wheeler on 5 Oct 2009 14:42 Morten Reistad <first(a)last.name> writes: > The interrupt-coalescing code helps bring the interrupt rate > down by an order of magnitude, so the interrupt rate is not > a showstopper anymore. > > I have a strong stomack feeling there is something going on > regarding l2 cache hit rate. "811" (i.e. March 1978) architecture allowed for stacking ending status on queue ... showed up with 370-xa in 3081s in the early 80s ... as well as placing outgoing requests on queue ... aka scenario to immediately take a interrupt ... was so that resource could be redriving with any pending requests could be redriven ... minimizing I/O resource idle time, "811" addressed both I/O interrupts trashing cache hit ratio as well as eliminating requiring processor synchronous participation in i/o "redrive". part of "811" was hardware interface for tracking busy & idle (for things like capacity planning) ... which had previously been done by software when kernel was involved in interrupts & redrive. start subchannel ("811" instruction for queuing i/o request) http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.9?SHELF=DZ9ZBK03&DT=20040504121320 set channel monitor ("811" instruction for measurement & statistics) http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.8?SHELF=DZ9ZBK03&DT=20040504121320 test pending interruption ("811" instruction for i/o completion w/o interrupt) http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/14.3.8?SHELF=DZ9ZBK03&DT=20040504121320 before 811 ... early 70s, i had modified my resourcer manager (on plain vanilla 370s) to monitor interrupt rate and dynamically switch for running enabled for i/o interrupts to disabling for i/o interrupts and only doing periodic "batch" drain of interrupts ... attempting to preserve cache locality. a two processor smp was offered for some models of 370 where only one of the processors had i/o capability. I had done an internal version of SMP support that were deployed in some places ... where I actually got higher aggregate MIP thruput than two single processors. Normally for two-processor 370 SMP ... the clock was slowed by 10% to provide head room for the processor caches to listen for cache invalidates (from the other cache) ... this resulted in two processor SMP having nominal 1.8 times a single processor (handling of any cache invalidates signals slowed it down further ... and any software SMP overhead slowed two-processor SMP even further ... compared to two single processor machines). In any case, with lots of tuning of SMP pathlengths ... and tweaks of how I/O and I/O interrupts were handled ... I got two processor SMP configuration up to better than twice thruput of two single processor machines (rule-of-the-thumb at the time said it should have only 1.3-1.5 times the thruput) ... basically because of preserving cache hit ratio. another part of "811" architecture was to eliminate overhead of passing thru the kernel for subsystem (demon) calls by applications. basically hardware table was defined with address space pointer and privileges for subsystems. application calls to subsystems then became very much like simple application call (for something in the applications address space). The api tended to be pointer passing ... so part of the interface was having alternate address space pointers ... and instructions where subsystems could directly access parameter values (indicated by passed pointer) back in the application address space. part of that 811 architecture description in current 64-bit http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/5.7?SHELF=DZ9ZBK03&DT=20040504121320 "program call" instruction http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/10.34?SHELF=DZ9ZBK03&DT=20040504121320 -- 40+yrs virtualization experience (since Jan68), online at home since Mar1970
From: Morten Reistad on 5 Oct 2009 15:06
In article <had760$vhk$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote: >In article <gefpp6-9p2.ln1(a)laptop.reistad.name>, >Morten Reistad <first(a)last.name> wrote: >> >>The interrupt-coalescing code helps bring the interrupt rate >>down by an order of magnitude, so the interrupt rate is not >>a showstopper anymore. >> >>I have a strong stomack feeling there is something going on >>regarding l2 cache hit rate. > >They are often related. The most common interrupt problem is that >the cache or TLB pollution is sufficiently bad that a critical >event doesn't get handled in time, though that can also happen >without interrupts being involved. If that happens, the code very >often switches from the fast path into a slow one, which can cause >its response to be delayed by much more than its request. That can >then cause the next level to miss an event, and so the phenomenon >builds up. We push this while watching packet loss. There is not much packet loss at all. We measure a few streams closely and watch the loss and interpacket jitter, and generate a synthetic MOS value. As soon as this MOS value falls below 4.0 we stop adding more atreams. We raraly have problems with packet loss at all in these tests, we just have cease adding streams because the jitter goes too high. If we then keep adding load we will have loss around 10% further out. With the interrupt coalescing code losing interrupts on a steady stream is not a problem, as long as we transmit or receive every 25 interrupts or so. This is why I want to see actual CPU interrupts and cache misses, separate from what the Linux drivers tell me. -- mrr |