Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Morten Reistad on 4 Oct 2009 08:03 In article <ha9svv$bs0$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote: >In article <5g6mp6-0p1.ln1(a)laptop.reistad.name>, >Morten Reistad <first(a)last.name> wrote: >>In article <7iknkoF31ud7sU1(a)mid.individual.net>, >> >>Speaking of which, are there any firm figures of letting Linux, BSDs, Inix, >>Solaris etc through the hoops on an upper three digit processor machine >>with intependent load processes where the OS has to do the MP handlig; >>e.g. socket, pipe, semaphores in monolithic processes? > >To the best of my knowledge, no. I am one of the few people with any >experience of even 64+ core systems, and the maximum I have personal >experience with is 264, but that was Hitachi OSF/1 on a SR2201 (which >was distributed memory, and VERY unlike those systems you mention). >Beyond that, it's 72 (Solaris on a SunFire) and 64 (IRIX on an Origin). >There are very, very few large SMPs anywhere. So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or IBM xSeries 34x). The first bottleneck seems to be memory, because the applications (media servers of all kinds) get such a tremendous boost from hypertransport and larger caches. Cache coherency and Linux seems like a brute-force endeavour as soon as we go past 4 processors or so,. We mostly tested Linux, but installed FreeBSD for a different view. FreeBSD handles more interrupts than Linux, but seems odd in the balancing results. It also seems to lock the kernel a bit more. The memory footprint is also slightly smaller. Next on the performance list is I/O, and interrupt/dma scheduling in particular. The fixes with interrupt coalescing in 2.6.24 seems to help a lot. >>I am looking for figures on OS performance itself, not user space. >> >>The reason I am asking is that I have been involved in a lot of testing >>of application performance lately, and it seems to me we are measuring >>the performance of BSD and Linux, not the application itself. > >That fails to surprise me. It took me 3 weeks of 80-hour weeks to >get the SGI Origin usable, because of an issue that was in the area you >are talking about. Having resolved that, it was fine. > >With that experience, the SunFire was a LOT easier - when we hit a >performance issue of the form you describe, I knew exactly what to >do to resolve it. Sun were less convinced, but it was one of the >things on their 'to try' list. The things I am specifically looking out for are cache coherency and interrupt handling. -- mrr
From: nmm1 on 4 Oct 2009 08:23 In article <ugfmp6-vu.ln1(a)laptop.reistad.name>, Morten Reistad <first(a)last.name> wrote: > >So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or >IBM xSeries 34x). The first bottleneck seems to be memory, because the >applications (media servers of all kinds) get such a tremendous boost from >hypertransport and larger caches. Cache coherency and Linux seems like >a brute-force endeavour as soon as we go past 4 processors or so,. 'Tain't just Linux. That's generally the main bottleneck. >We mostly tested Linux, but installed FreeBSD for a different view. >FreeBSD handles more interrupts than Linux, but seems odd in the balancing >results. It also seems to lock the kernel a bit more. The memory >footprint is also slightly smaller. Interesting. >Next on the performance list is I/O, and interrupt/dma scheduling in particular. >The fixes with interrupt coalescing in 2.6.24 seems to help a lot. Gug. Or, more, precisely, gug, gug, gug :-( That is NOT a nice area, not at all. I have no direct experience of Linux in that regard, but doubt that it is much different from IRIX and Solaris. The details will differ, of course. >The things I am specifically looking out for are cache coherency and >interrupt handling. Do you want a consultant? :-) More seriously, those were precisely the area that caused me such problems. At one stage, I locked the Origin up so badly that I couldn't power cycle it from the control panel, and had to flip the breakers on each rack. One of the keys is to separate the interrupt handling from the parallel applications - and I mean on separate cores. Regards, Nick Maclaren.
From: Morten Reistad on 4 Oct 2009 09:34 In article <haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote: >In article <ugfmp6-vu.ln1(a)laptop.reistad.name>, >Morten Reistad <first(a)last.name> wrote: >> >>So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or >>IBM xSeries 34x). The first bottleneck seems to be memory, because the >>applications (media servers of all kinds) get such a tremendous boost from >>hypertransport and larger caches. Cache coherency and Linux seems like >>a brute-force endeavour as soon as we go past 4 processors or so,. > >'Tain't just Linux. That's generally the main bottleneck. I kinda knew that. But I was still surprised at HOW big the results were. 8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with 64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses, 4way hyperchannel instead of 1way; and one extra south bridge. Are the memory and cache access counters on the xeons accessible from a Linux environment ? >>We mostly tested Linux, but installed FreeBSD for a different view. >>FreeBSD handles more interrupts than Linux, but seems odd in the balancing >>results. It also seems to lock the kernel a bit more. The memory >>footprint is also slightly smaller. > >Interesting. > >>Next on the performance list is I/O, and interrupt/dma scheduling in particular. >>The fixes with interrupt coalescing in 2.6.24 seems to help a lot. > >Gug. Or, more, precisely, gug, gug, gug :-( > >That is NOT a nice area, not at all. I have no direct experience of >Linux in that regard, but doubt that it is much different from IRIX >and Solaris. The details will differ, of course. The Linux irq/dma balancer tuning is wsomewere between whichcraft and black magic. You can nudge it so it performs well, but on the next boot is misperforms. It seems it needs a few billion interrupts to actually get a good picture of where the interrupts are likely to be. >>The things I am specifically looking out for are cache coherency and >>interrupt handling. > >Do you want a consultant? :-) > >More seriously, those were precisely the area that caused me such >problems. At one stage, I locked the Origin up so badly that I >couldn't power cycle it from the control panel, and had to flip >the breakers on each rack. One of the keys is to separate the >interrupt handling from the parallel applications - and I mean on >separate cores. This is one thing Linux does very well. But we see that the actual user mode code takes very small amounts of cpu, unless we are transcoding. I have tested media with SER/rtppproxy, asterisk, various RTP code written inhouse, It makes a huge difference to do simple rtp NAT/ mixing in a kernel driver, either as IPtables programming or as custom code. - mrr
From: Terje Mathisen on 4 Oct 2009 13:16 Morten Reistad wrote: > In article<haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote: >> 'Tain't just Linux. That's generally the main bottleneck. > > I kinda knew that. But I was still surprised at HOW big the results were. > 8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with > 64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses, > 4way hyperchannel instead of 1way; and one extra south bridge. That's almost an order of magnitude... The small machine had 3 MB L2/core, while the big one had 4 MB for each. Did you stumble over the edge of a 3.5 MB working set cliff? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Morten Reistad on 4 Oct 2009 15:45
In article <LaKdnRhs_a_zRVXXnZ2dnUVZ8lydnZ2d(a)lyse.net>, Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote: >Morten Reistad wrote: >> In article<haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote: >>> 'Tain't just Linux. That's generally the main bottleneck. >> >> I kinda knew that. But I was still surprised at HOW big the results were. >> 8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with >> 64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses, >> 4way hyperchannel instead of 1way; and one extra south bridge. > >That's almost an order of magnitude... > >The small machine had 3 MB L2/core, while the big one had 4 MB for each. > >Did you stumble over the edge of a 3.5 MB working set cliff? I just cannot see what it is that has such effects per cpu. We have seen similar effects (order of magnitude) with and without hyperchannel linkage between the caches. This class of problem is one where the problem is running large numbers of identical, simple tasks where the multiprogramming is done by the kernel and common driver software. There was obviously some edge there, but we are very clearly measuring Linux, not the application; because we can swap the application between SER+rpttpoxy, asterisk and the yate proxy with very little impact on the observed numbers. The user mode code uses around 4-6% of the cpu time, about twice that is used for task switching, and twice that again in interrupt service mode. Linux 2.7.24 made a huge difference in how much interrupt load a MP setup can use with the interrupt coalescing code in the drivers. That brought The applications are coded using all three main apis, select(), poll() and monster numbers of synchronous threads. They behave equally well, differences are too small to be significant. These tasks are just bridging RTP streams of 160 octets payload, 24 octets RTP, 8 octets UDP, 20 IP and 16 Ethernet (2 extra for alignment); 228 octets frames, 50 per second. In the IP frame the TOS, TTL, source and destination addresses plus header sum are changed, UDP sees ports and checksum change, and RTP sees sequence, timestamp and ssid change. I am working on a kernel driver for this substitution, so I can put it directly in the routing code, and avoid all the excursions into user mode. But I would like to see what the memory caches really are doing before I start optimising. -- mrr |