Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
Next: A naive conjecture about computer architecture and artificialintelligence
From: jacko on 12 Jul 2010 06:08 > What really baffles me is that I know there is a market for LESS > performance - provided that it comes in a lot less power hungry. > Not a small market, either. And it could be done. Well here's an idea. I need 800MHz to play divx, socket 462 of course. I currently do this by underclocking an athlon which will not run at its rated speed. It needs no fan, just the heatsink. I know it would be possible to have a plugin replacement that was as or more powerful, but ran cool enough to not even need the heatsink. I'd pay say £30 when I need a replacement.
From: Terje Mathisen "terje.mathisen at on 12 Jul 2010 10:39 Andy Glew wrote: > I think that Nick may have meant a 5% relative improvement n branch > prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) = > 0.30%, to a net branch prediction rate of 94.30%. > > But in case anyone cares, even a 0.30% increase in performance is the > sort of thing Intel and AMD would kill for now. > > When I started as a computer architecture, my boss said "don't waste > time on anything that doesn't get a 2X improvement". Then 20%. Now, we > are scraping the bottom of the barrel looking for 1%ers. A somewhat related story about barrel scraping: A few years ago (3?) I was asked (by Rad Game Tools) to optimize a public domain Ogg Vorbis decoder, with the goal to beat the fastest available closed-source implementation: Since 60-70% of the decoding time was spent in the IMDCT (Inverse Modified Discrete Cosine Transform), the initial suggestion from Jeff Roberts was to simply write a SIMD version of the existing IMDCT (which was believed to be optimal, but written in plain C). I did that, and made that function about 3X faster, which is a perfectly OK result for a 4-wide vector implementation since it also carries more overhead in order to gather the input data, like swizzle factors. The problem was that this wasn't enough: I was still 50% slower than the target. :-( At this point I started to go through every single part of the algorithm looking for both SIMD opportunities (I did find a few) and regular algorithmic optimizations, with the result that after a week or two I got down to 5-10% slower than what I needed to match. One of the big items was a complete rewrite of the Huffman decoder, making it completely table-driven even for longer tokens (multi-level nested tables). Barrel scraping time: At this point I was nearly desperate, I reconsidered the IMDCT multiple times without finding any problems there, so I started to simply try _everything_: Any form of optimization that I had ever tried myself or heard about had to be tested, and the result was that at least 90% turned out to make the code slower instead of faster. :-( Anyway, I kept at it for a few more weeks and finally got to the point where my code had near parity in 32-bit mode (2% slower for stereo and 5% faster for mono), while in 64-bit mode it was about 10-15% faster for all sound samples. Terje PS. After I had turned in that barely-faster implementation (It did have a few other good points, mainly that a single binary would run on all x86 SSE versions, and the source code used macros and compiler intrinsics so it could be compiled for both 32 and 64-bit platforms, on any cpu which supported 4-wide SIMD operations) Jeff and Sean Barrett (who had written the public domain code I started from) discovered that the IMDCT was far from optimal after all: It was possible to get rid of almost half the outer loop operations, so after another round of SIMD vectorization my decoder was suddenly 25-30% faster than anything else. :-) > As y'all know, I think there are big improvements possible in single > thread CPU performance - 10%ers at least, and NXers in some areas. The real key imho is that someone has to really believe that improvement is possible, even if it will take a lot of effort to get there. -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: MitchAlsup on 12 Jul 2010 11:36 On Jul 11, 11:26 pm, Andy Glew <gigan...(a)andy.glew.ca> wrote: I remember back in the RICS 1st generation, we attempted to drop basically anything that did not give 10% > But in case anyone cares, even a 0.30% increase in performance is the > sort of thing Intel and AMD would kill for now. And this, gentlemen, is why x86 won. Having a 1% microarchitectural gain every 3 months for a decade on top of the performance on gets by dropping $1B in the FAB technology every quarter/year (depending), and 4 generations of new microarchitecture on top of it all. Mitch
From: James Van Buskirk on 12 Jul 2010 16:59 "Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message news:lc1sg7-5id1.ln1(a)ntp.tmsw.no... > OTOH, afaik it should definitely be possible to plug length=256 and > vector=4 into the FFTW synthesizer and get a very big, completely > unrolled, minimum-operation count, piece of code out of it. FFTW is in no way capable of producing minimum operation count code. I beat it every time. The only way that their code generator can catch up to my algorithms is if they look at my code and incorporate its new tricks into the set of transformations that their code generator tries. Surprising that they took the problem to only one coder since my understanding of the situation with SSE2 is that there must be many coders out there, each of whom knows a trick or two that the others don't that can increase performance by a percent or so. Of course each coder would probably want to be paid for revealing their secrets and it could end up costing a lot for a fairly small gain in performance. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
From: Robert Myers on 12 Jul 2010 17:42
On Jul 12, 4:59 pm, "James Van Buskirk" <not_va...(a)comcast.net> wrote: > "Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in messagenews:lc1sg7-5id1.ln1(a)ntp.tmsw.no... > > > OTOH, afaik it should definitely be possible to plug length=256 and > > vector=4 into the FFTW synthesizer and get a very big, completely > > unrolled, minimum-operation count, piece of code out of it. > > FFTW is in no way capable of producing minimum operation count > code. I beat it every time. The only way that their code > generator can catch up to my algorithms is if they look at my code > and incorporate its new tricks into the set of transformations > that their code generator tries. > > Surprising that they took the problem to only one coder since my > understanding of the situation with SSE2 is that there must be many > coders out there, each of whom knows a trick or two that the others > don't that can increase performance by a percent or so. Of course > each coder would probably want to be paid for revealing their > secrets and it could end up costing a lot for a fairly small gain > in performance. > I've only glanced at FFTW, but what you say doesn't surprise me. I didn't mean to imply to Terje that what I had been taught about what was at the time referred to as a VFFT was directly applicable to his problem. I only meant to make the unsurprising comment that there always seems to be one more trick around the next corner, even when you're starting with something as slick, well-known, and thoroughly worked-over as the FFT. I suspect the same would be true of the IMDCT. Robert. |