Prev: A naive conjecture about computer architecture and artificial intelligence
Next: 2nd call - Applied Computing 2010: until 26 July 2010
From: Terje Mathisen "terje.mathisen at on 12 Jul 2010 17:28 James Van Buskirk wrote: > "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message > news:lc1sg7-5id1.ln1(a)ntp.tmsw.no... > >> OTOH, afaik it should definitely be possible to plug length=256 and >> vector=4 into the FFTW synthesizer and get a very big, completely >> unrolled, minimum-operation count, piece of code out of it. > > FFTW is in no way capable of producing minimum operation count > code. I beat it every time. The only way that their code Nice! > generator can catch up to my algorithms is if they look at my code > and incorporate its new tricks into the set of transformations > that their code generator tries. OK, that's good. By how many percent would your code beat them for the 256 and 2048 element IMDCT? > > Surprising that they took the problem to only one coder since my > understanding of the situation with SSE2 is that there must be many > coders out there, each of whom knows a trick or two that the others > don't that can increase performance by a percent or so. Of course > each coder would probably want to be paid for revealing their > secrets and it could end up costing a lot for a fairly small gain > in performance. I have the great advantage that I don't make a living from my optimization work, so I don't need to keep any secrets. :-) My current daytime job is to be chief architect for the complete swap operation of the largest Norwegian cell network, i.e. pretty far from SIMD optimization. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen "terje.mathisen at on 13 Jul 2010 07:36
James Van Buskirk wrote: > "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message > news:l1esg7-34e1.ln1(a)ntp.tmsw.no... > >> By how many percent would your code beat them for the 256 and 2048 element >> IMDCT? > > That would require my writing code for that transform in the first OK. IMDCT is particularly interesting these days because every single audio codec has been built around it. > place. In operation counts, perhaps identical because I haven't > come up with any new algorithm since the one that I published that > still holds the minimum for power of 2 FFTs (unless someone else > has beaten me subsequently), and since that algorithm can be > considered to be built on DCTs... > > Looking at their timing numbers for smaller power of 2 FFTs it seems > to me that FFTW doesn't utilize the 2 way SIMD capabilities of at > least a core 2 duo effectively. I am pretty much ignorant of the > style of project you are working on: is it single-precision, double- Audio codecs do very well with single precision and 4-way SIMD processing. Almost all of Vorbis is defined to be fp, but it is of course possible to write a fixed-point implementation. The main problem is that at one particular point you must, by codec definition, be prepared to handle 64 bits of dynamic range... > precision or integer data? I'm really not very interested in this > stuff any more; I'm trying to make progress in completely different > projects. > Anything I could help out on? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |