Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Benny Amorsen on 4 Oct 2009 16:16 Morten Reistad <first(a)last.name> writes: > The applications are coded using all three main apis, select(), > poll() and monster numbers of synchronous threads. They behave > equally well, differences are too small to be significant. I am a little bit surprised that they behave equally well. Asterisk (the only one I have looked at) seems to make an extra system call per packet according to strace, and I would have expected that to have an impact. > I am working on a kernel driver for this substitution, so I > can put it directly in the routing code, and avoid all the > excursions into user mode. It seems like the splice system call ought to be able to do this, but I don't think it works for UDP, and it probably isn't good for small payloads like this. Conceptually it seems like the right path... /Benny
From: Morten Reistad on 4 Oct 2009 17:30 In article <m3r5ti3oft.fsf(a)ursa.amorsen.dk>, Benny Amorsen <benny+usenet(a)amorsen.dk> wrote: >Morten Reistad <first(a)last.name> writes: > >> The applications are coded using all three main apis, select(), >> poll() and monster numbers of synchronous threads. They behave >> equally well, differences are too small to be significant. > >I am a little bit surprised that they behave equally well. Asterisk (the >only one I have looked at) seems to make an extra system call per packet >according to strace, and I would have expected that to have an impact. Asterisk actually performs best. But the user mode code represents less than 1/20th of the cpu time expended, so user mode optimisations will not have much impact. As I said in an earlier posting, 1/20th is used in user mode, 1/10th in task switching, 1/4th in interrupt code (800 megabit two way in small pakcet mode) and the remaining 2/3 inside the Linux kernel. >> I am working on a kernel driver for this substitution, so I >> can put it directly in the routing code, and avoid all the >> excursions into user mode. > >It seems like the splice system call ought to be able to do this, but I >don't think it works for UDP, and it probably isn't good for small >payloads like this. Conceptually it seems like the right path... The bottleneck here isn't in user mode code at all. That was why we tried FreeBSD as a test. If was not much different. Somewhat tighter code and somewhat coarser locks, but not that big a difference. -- mrr
From: nmm1 on 4 Oct 2009 18:33 In article <qrkmp6-pbh.ln1(a)laptop.reistad.name>, Morten Reistad <first(a)last.name> wrote: > >I kinda knew that. But I was still surprised at HOW big the results were. >8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with >64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses, >4way hyperchannel instead of 1way; and one extra south bridge. That fails to surprise me. My standard recommendation is that Intel can handle 2 sockets but not 4, and AMD 4 but not 8. >Are the memory and cache access counters on the xeons accessible from a >Linux environment ? As far as I know, they are still "work in progress" except for the Itanium. Part of the problem is that Intel and AMD won't disclose the interfaces. >The Linux irq/dma balancer tuning is wsomewere between whichcraft and >black magic. You can nudge it so it performs well, but on the next boot >is misperforms. It seems it needs a few billion interrupts to actually >get a good picture of where the interrupts are likely to be. It's not something I looked at, but that doesn't surprise me, either. >>One of the keys is to separate the >>interrupt handling from the parallel applications - and I mean on >>separate cores. > >This is one thing Linux does very well. But we see that the actual >user mode code takes very small amounts of cpu, unless we are transcoding. Then what you want to do is to separate the kernel threads from the interrupt handling, and I doubt that you can. Interestingly, that is where I had the main problems with Solaris - in the untuned state, a packet could take over a second to get from the user code to the device. And that was on a 72-CPU system with one (count it, one) user process running. God alone knows what happened to it in between. >I have tested media with SER/rtppproxy, asterisk, various RTP code >written inhouse, > >It makes a huge difference to do simple rtp NAT/ mixing in a kernel >driver, either as IPtables programming or as custom code. Yes. There was a time when you could transfer from SunOS to HP-UX at 4 times the speed of the reverse direction (or the other way round - I forget). Regards, Nick Maclaren.
From: Kim Enkovaara on 5 Oct 2009 01:51 nmm1(a)cam.ac.uk wrote: > Morten Reistad <first(a)last.name> wrote: >> Are the memory and cache access counters on the xeons accessible from a >> Linux environment ? > > As far as I know, they are still "work in progress" except for the > Itanium. Part of the problem is that Intel and AMD won't disclose > the interfaces. My understanding is that they are quite well supported. For example see the "event type" section in oprofile documentation (http://oprofile.sourceforge.net/docs/) There is also new tool called perf for linux, but I have not tried that yet. --Kim
From: Terje Mathisen on 5 Oct 2009 02:13
Morten Reistad wrote: > In article<LaKdnRhs_a_zRVXXnZ2dnUVZ8lydnZ2d(a)lyse.net>, > Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote: >> Did you stumble over the edge of a 3.5 MB working set cliff? > > I just cannot see what it is that has such effects per cpu. > We have seen similar effects (order of magnitude) with and > without hyperchannel linkage between the caches. > > This class of problem is one where the problem is running > large numbers of identical, simple tasks where the multiprogramming > is done by the kernel and common driver software. > > There was obviously some edge there, but we are very clearly > measuring Linux, not the application; because we can swap the OK [snip] > The applications are coded using all three main apis, select(), > poll() and monster numbers of synchronous threads. They behave > equally well, differences are too small to be significant. > > These tasks are just bridging RTP streams of 160 octets payload, > 24 octets RTP, 8 octets UDP, 20 IP and 16 Ethernet (2 extra for > alignment); 228 octets frames, 50 per second. > > In the IP frame the TOS, TTL, source and destination addresses > plus header sum are changed, UDP sees ports and checksum change, > and RTP sees sequence, timestamp and ssid change. OK > > I am working on a kernel driver for this substitution, so I > can put it directly in the routing code, and avoid all the > excursions into user mode. That will be interesting... > > But I would like to see what the memory caches really are > doing before I start optimising. Afaik, there is at least one portable (linux) library that gives you access to the performance monitoring counters on several cpu architectures. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |