A naive conjecture about computer architecture and artificial intelligence [Computer Architecture]

Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
Next: A naive conjecture about computer architecture and artificialintelligence

From: jacko on 12 Jul 2010 06:08

> What really baffles me is that I know there is a market for LESS
> performance - provided that it comes in a lot less power hungry.
> Not a small market, either. And it could be done.

Well here's an idea. I need 800MHz to play divx, socket 462 of course.
I currently do this by underclocking an athlon which will not run at
its rated speed. It needs no fan, just the heatsink. I know it would
be possible to have a plugin replacement that was as or more powerful,
but ran cool enough to not even need the heatsink.

I'd pay say £30 when I need a replacement.

From: Terje Mathisen "terje.mathisen at on 12 Jul 2010 10:39

Andy Glew wrote:
> I think that Nick may have meant a 5% relative improvement n branch
> prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) =
> 0.30%, to a net branch prediction rate of 94.30%.
>
> But in case anyone cares, even a 0.30% increase in performance is the
> sort of thing Intel and AMD would kill for now.
>
> When I started as a computer architecture, my boss said "don't waste
> time on anything that doesn't get a 2X improvement". Then 20%. Now, we
> are scraping the bottom of the barrel looking for 1%ers.

A somewhat related story about barrel scraping:

A few years ago (3?) I was asked (by Rad Game Tools) to optimize a
public domain Ogg Vorbis decoder, with the goal to beat the fastest
available closed-source implementation:

Since 60-70% of the decoding time was spent in the IMDCT (Inverse
Modified Discrete Cosine Transform), the initial suggestion from Jeff
Roberts was to simply write a SIMD version of the existing IMDCT (which
was believed to be optimal, but written in plain C).

I did that, and made that function about 3X faster, which is a perfectly
OK result for a 4-wide vector implementation since it also carries more
overhead in order to gather the input data, like swizzle factors.

The problem was that this wasn't enough: I was still 50% slower than the
target. :-(

At this point I started to go through every single part of the algorithm
looking for both SIMD opportunities (I did find a few) and regular
algorithmic optimizations, with the result that after a week or two I
got down to 5-10% slower than what I needed to match. One of the big
items was a complete rewrite of the Huffman decoder, making it
completely table-driven even for longer tokens (multi-level nested tables).

Barrel scraping time:

At this point I was nearly desperate, I reconsidered the IMDCT multiple
times without finding any problems there, so I started to simply try
_everything_: Any form of optimization that I had ever tried myself or
heard about had to be tested, and the result was that at least 90%
turned out to make the code slower instead of faster. :-(

Anyway, I kept at it for a few more weeks and finally got to the point
where my code had near parity in 32-bit mode (2% slower for stereo and
5% faster for mono), while in 64-bit mode it was about 10-15% faster for
all sound samples.

Terje

PS.
After I had turned in that barely-faster implementation (It did have a
few other good points, mainly that a single binary would run on all x86
SSE versions, and the source code used macros and compiler intrinsics so
it could be compiled for both 32 and 64-bit platforms, on any cpu which
supported 4-wide SIMD operations) Jeff and Sean Barrett (who had written
the public domain code I started from) discovered that the IMDCT was far
from optimal after all: It was possible to get rid of almost half the
outer loop operations, so after another round of SIMD vectorization my
decoder was suddenly 25-30% faster than anything else. :-)

> As y'all know, I think there are big improvements possible in single
> thread CPU performance - 10%ers at least, and NXers in some areas.

The real key imho is that someone has to really believe that improvement
is possible, even if it will take a lot of effort to get there.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: MitchAlsup on 12 Jul 2010 11:36

On Jul 11, 11:26 pm, Andy Glew <gigan...(a)andy.glew.ca> wrote:

I remember back in the RICS 1st generation, we attempted to drop
basically anything that did not give 10%

> But in case anyone cares, even a 0.30% increase in performance is the
> sort of thing Intel and AMD would kill for now.

And this, gentlemen, is why x86 won. Having a 1% microarchitectural
gain every 3 months for a decade on top of the performance on gets by
dropping $1B in the FAB technology every quarter/year (depending), and
4 generations of new microarchitecture on top of it all.

Mitch

From: James Van Buskirk on 12 Jul 2010 16:59

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:lc1sg7-5id1.ln1(a)ntp.tmsw.no...

> OTOH, afaik it should definitely be possible to plug length=256 and
> vector=4 into the FFTW synthesizer and get a very big, completely
> unrolled, minimum-operation count, piece of code out of it.

FFTW is in no way capable of producing minimum operation count
code. I beat it every time. The only way that their code
generator can catch up to my algorithms is if they look at my code
and incorporate its new tricks into the set of transformations
that their code generator tries.

Surprising that they took the problem to only one coder since my
understanding of the situation with SSE2 is that there must be many
coders out there, each of whom knows a trick or two that the others
don't that can increase performance by a percent or so. Of course
each coder would probably want to be paid for revealing their
secrets and it could end up costing a lot for a fairly small gain
in performance.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

From: Robert Myers on 12 Jul 2010 17:42

On Jul 12, 4:59 pm, "James Van Buskirk" <not_va...(a)comcast.net> wrote:
> "Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in messagenews:lc1sg7-5id1.ln1(a)ntp.tmsw.no...
>
> > OTOH, afaik it should definitely be possible to plug length=256 and
> > vector=4 into the FFTW synthesizer and get a very big, completely
> > unrolled, minimum-operation count, piece of code out of it.
>
> FFTW is in no way capable of producing minimum operation count
> code. I beat it every time. The only way that their code
> generator can catch up to my algorithms is if they look at my code
> and incorporate its new tricks into the set of transformations
> that their code generator tries.
>
> Surprising that they took the problem to only one coder since my
> understanding of the situation with SSE2 is that there must be many
> coders out there, each of whom knows a trick or two that the others
> don't that can increase performance by a percent or so. Of course
> each coder would probably want to be paid for revealing their
> secrets and it could end up costing a lot for a fairly small gain
> in performance.
>

I've only glanced at FFTW, but what you say doesn't surprise me.

I didn't mean to imply to Terje that what I had been taught about what
was at the time referred to as a VFFT was directly applicable to his
problem. I only meant to make the unsurprising comment that there
always seems to be one more trick around the next corner, even when
you're starting with something as slick, well-known, and thoroughly
worked-over as the FFT. I suspect the same would be true of the
IMDCT.

Robert.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
Next: A naive conjecture about computer architecture and artificialintelligence