From: "Andy "Krazy" Glew" on 22 Oct 2009 23:36 Terje Mathisen wrote: > Andy "Krazy" Glew wrote: > Andy, you really owe it to yourself to take a hard look at h264 and > CABAC: In approximately the same timeframe as DES was replaced with AES, > with a stated requirement of being easy to make fast/efficient on a > PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive > Binary Arithmetic Coder" was the best choice for a video codec. > > CABAC requires 3 or 4 branches for every single _bit_ decoded, and the > last of these branches depends on the value of that decoded bit. > > Until you've made that branch you don't even know which context to apply > when decoding the next bit! > > (I have figured out workarounds (either branchless code or making them > predictable) for most of those inline branches in the bit decoder, but > that last context branch is unavoidable.) > > The only possible skip ahead is a really big one: You have to locate the > next key frame and start another core/thread, but this approach is of > extremely limited value if you are in a realtime situation, i.e. video > conferencing. > > Terje I have looked at CABAC, and you are right, it seems to be the branch equivalent of chasing down a hash chain. It is also the equivalent of good online compression (well, duh) and encryption, where every bit depends on all previous bits (and possibly some or alll future bits. And if you can't skip to the next independent chunk of work - if there is no independent work to skip to - you are screwed. You have to make the dependent stuff run faster. Or do nothing at all. You make the dependent stuff run faster by architectures that make sequential code run faster - by having faster ALUs or, if it is important enough, by having dedicated hardware. Is CABAC important enough? E.g. Terje, you're known to be a Larrabee fan. Can you vectorize CABAC? I'm not opposed to making sequentially dependent stuff run faster. I'm just observing that, if device limitations get in the way, there are lots of workloads that are not dominated bny sequentially dependent stuff (at the fine grain). As for CABAC, I must admit that I have some hope in algorithmic techniques similar to those you were recently discussing for parallelizing encryption. For example: divide the image up into subblocks, and run CABAC on each subblock in parallel. To obtain similar compression ratios you would have to have keyframes less frequently. Bursts at keyframes possibly could be avoided by skewing them. Moreover, I remain a fan of model based encoding. Although that requires significantly more computation, it is parallel.
From: "Andy "Krazy" Glew" on 23 Oct 2009 00:12 Mayan Moudgill wrote: > Andy "Krazy" Glew wrote: > >> (3) Recall that I am a fan of skip-ahead, speculative multithreading >> architectures such as Haitham Akkary's DMT. If you can't predict a >> branch, skip ahead to the next loop iteration or function return, and >> execute code that you know will be executed with high probability. > > I was wondering - how much of the DMT performance improvement is > becauase of all the speculative execution, annd how much of it is > because it's acting as a I-cache prefetch engine? IIRC, the performance > numbers for some of the non-linear I-prefetch schemes seem to track the > performance improvements reported by DMT. It's about half and half. Every SpMT / DMT simulator that I have seen has the option of turning off speculation, and just using skipahead as an instruction prefetcher. And possibly a data prefetcher. In fact, good proposals don't bother to store the speculative results for instructions that can easily be recomputed - it's easier to (re) compute than it is to look up in a large store. One might then reasonably ask "Why not take the hardware that is needed for SpMT, and use it to make your predictor tablees larger?" Which is totally valid, and explains much of the last 10 years in CPU microarchitecture: we might call this the era of predictors. Aided and abetted by the fact that application got *SIMPLER* in thwe last decade, as simplistic multimedia codes with simple access patterns became especially important. However, most "predictors" are history based. They predict the last value seen, or a linear stride added to the last few values seen, or some other extrapolation. Or they rely on something like a Markov model for state transitions. I suppose that you can add more curve fitting to your predictor, but the easiest way to see what complicated non-linear data access patterns may be occurring is to actually execute the code. Use the real values if possible, or the best predictions for values that you can't obtain and execute the intervening code. "Code based prefetch". I once worked with a prefetcher guy who really, really, wanted access to the instruction stream. And the TLBs. What he was doing was executing chunks of code - by no means the whole code, but just the parts that he needed - and using that in his preftcher / address predictor. There's a spectrum, ranging from (a) speculatively execute nothing, preduct and prefetch everything through (z) speculatively execute everything, remembering all speculative results, with intermediate points such as (m) speculatively execute everything, remembering cache miss results only, rexecute and verify, (n) remember cache misses + computations that a simple timing model shows would be on the critical path, (p) don't remember cache misses - bias the cache replacement policy, and (g) on the other side, speculatively execute using data value predictors and/or whatever stale data you have in the cavhe.
From: "Andy "Krazy" Glew" on 23 Oct 2009 00:23 Robert Myers wrote: > On Oct 21, 11:54 pm, Joe Pfeiffer <pfeif...(a)cs.nmsu.edu> wrote: >> We were sure supposed to take it seriously -- didn't Merced actually >> have a i386 core on it when delivered? > > It had something or other, but PIII had to be in the works (Andy would > know) and it would have stomped anything that came before. I am not aware of an Itanium shipped or proposed that had an "x86 core on the side". There were proposals to have some special purpose hardware, like some x86 instruction decoders that packed into VLIW instructions. > That is to say, I find it hard to believe that anyone took Itanium > seriously as an x86 competitor. I can assure you that it was sold that way to Intel senior management.
From: Terje Mathisen on 23 Oct 2009 02:28 Andy "Krazy" Glew wrote: > Terje Mathisen wrote: >> Andy "Krazy" Glew wrote: > >> Andy, you really owe it to yourself to take a hard look at h264 and >> CABAC: In approximately the same timeframe as DES was replaced with >> AES, with a stated requirement of being easy to make fast/efficient on >> a PentiumPro cpu, the MPEG working groups decided that a "Context >> Adaptive Binary Arithmetic Coder" was the best choice for a video codec. >> >> CABAC requires 3 or 4 branches for every single _bit_ decoded, and the >> last of these branches depends on the value of that decoded bit. >> >> Until you've made that branch you don't even know which context to >> apply when decoding the next bit! >> >> (I have figured out workarounds (either branchless code or making them >> predictable) for most of those inline branches in the bit decoder, but >> that last context branch is unavoidable.) >> >> The only possible skip ahead is a really big one: You have to locate >> the next key frame and start another core/thread, but this approach is >> of extremely limited value if you are in a realtime situation, i.e. >> video conferencing. >> >> Terje > > > I have looked at CABAC, and you are right, it seems to be the branch > equivalent of chasing down a hash chain. > > It is also the equivalent of good online compression (well, duh) and > encryption, where every bit depends on all previous bits (and possibly > some or alll future bits. > > And if you can't skip to the next independent chunk of work - if there > is no independent work to skip to - you are screwed. You have to make > the dependent stuff run faster. Or do nothing at all. You make the > dependent stuff run faster by architectures that make sequential code > run faster - by having faster ALUs or, if it is important enough, by > having dedicated hardware. Is CABAC important enough? It is almost certainly important enough that anything remotely power-sensitive will need dedicated hw to handle at least the CABAC part. > > E.g. Terje, you're known to be a Larrabee fan. Can you vectorize CABAC? Not at all, afaik. > > I'm not opposed to making sequentially dependent stuff run faster. I'm > just observing that, if device limitations get in the way, there are > lots of workloads that are not dominated bny sequentially dependent > stuff (at the fine grain). > > As for CABAC, I must admit that I have some hope in algorithmic > techniques similar to those you were recently discussing for > parallelizing encryption. > > For example: divide the image up into subblocks, and run CABAC on each > subblock in parallel. To obtain similar compression ratios you would This is the only silver lining: Possibly due to the fact that they were working on PS3 at the time,Sony specified that Bluray frames are all split into 4 independent quadrants, which means that they could trivially split the job across four of the 7 or 8 cell cores. This also reduced the size of each subframe, in 1080i, to 256 K pixels. :-) > have to have keyframes less frequently. Bursts at keyframes possibly > could be avoided by skewing them. > > Moreover, I remain a fan of model based encoding. Although that requires > significantly more computation, it is parallel. OK. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 23 Oct 2009 04:13
In article <DcudnTf8XIfmmXzXnZ2dnUVZ_u2dnZ2d(a)metrocastcablevision.com>, Bill Todd <billtodd(a)metrocast.net> wrote: > >The fact that Itanic came so close to world domination *despite* its >abject failure to deliver on the promises that had seemed to make that >domination inevitable tends to prove that the attempt to bluff its way >to success was a daring and risky move but hardly an insane one. ... Not really. It was a lot further from that than the hype indicated. It made practical headway in two areas, so let's consider them. HPC was its most successful area, and something like two sites tried it and rejected it for every one that delivered a service using it. The SGI Altix was the main success, though I have heard that Bull made headway and suspect that Hitachi may have made some, too. And, when I say rejected, I mean that it was often made a condition on future tenders - i.e. don't tender IA64, as it will not be short listed. [ Aside: EU procurement law makes bias by public purchasers illegal, but Those Of Us With Clue had no difficulty in funding technical and financial reasons to veto IA64. Like, for example, just WHERE can you find staff capable of tracking down code-generation bugs in compilers for parallel IA64 codes? If anyone says "the vendor", then he clearly doesn't understand HPC. ] The other was Mission Critical computers for Big Business. I met people from several of those, and they had all taken the position that they were going to run it in parallel with their existing systems for a year or more before making a decision. Asking how it was going got a very po-faced non-response. My point here is that, if the Itanic had started to be pushed much harder, the real heavyweights would have joined the opposition. It never had an earthly of doing what it was originally hyped to do (i.e. entirely replace x86). Regards, Nick Maclaren. |