From: Bernd Paysan on 22 Oct 2009 08:15 Andy "Krazy" Glew wrote: > Some of them think that the problem with VLIW was the instruction set. The problem of Itanic was the designed-by-committee-ISA. Too many "good" features together aren't good anymore. The scaling concept of Itanic was wrong, either. Look at current GPGPUs, and how they scale (e.g. ATI/AMD's): Through a relatively simple VLIW instruction set, through SIMD, through multi-threading, through multi-core. Multi-everything. The ISA itself seems to be stabilizing, but the interface usually is through source-like stuff (when you use OpenCL, you don't ship precompiled binaries, neither do you when you use DirectX shader programs or the corresponding OpenGL stuff). Itanic scaling first was mainly through added ILP, scaling VLIW. I didn't believe that you could scale VLIW beyond some sweet spot, e.g. the four integer operations per cycle of my 4stack. If you go further, the return is diminishing. The same is true for OOO - you can extract some parallelism out of a program, but not too much. If you use up your transistor resources mostly for supporting logic, and not for the actual work itself, you go into the wrong direction. OOO and VLIW are there to maximize the output of a single core. Given that a lot of software is just written for that, it makes sense. But only to what this software has as inherent parallelism. > Some of them think that binary translation will solve all those > problems. Me, I think that binary translation is a great idea - but > only if the target microarchitecture makes sense in the absence of > binary translation, if we lived in an open source world and x86 > compatibility was not an issue. We live in an open source world, and x86 compatibility isn't an issue - if you ignore the Windows-dominated desktop ;-). Binary translation matters for closed source, for open source, it's a non-issue. Below the desktop, in the mobile internet devices, open source already is quite dominant (even when the whole offering is proprietary, like Apple's iPhone, most parts are open source), above the desktop, on servers, the same is true. Smaller devices have custom applications, i.e. even though they often have proprietary licenses or simply are trade secrets, the programmer has the sources readily available. However, this is 10 years later. And for higher level of parallelism, even open source doesn't help. If you want to make GCC use a GPGPU, you better rewrite it from scratch (actually, that's my suggestion anyway: rewrite GCC from scratch, it stinks of rotten source ;-). Same for many other programs. We discussed about Excel and the suboptimal algorithms there - why not redesign the spreadsheet around high performance computing? A column is a vector, a sheet is a matrix? Create data dependencies for recalculation? -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Bernd Paysan on 22 Oct 2009 08:37 Terje Mathisen wrote: > Andy, you really owe it to yourself to take a hard look at h264 and > CABAC: In approximately the same timeframe as DES was replaced with AES, > with a stated requirement of being easy to make fast/efficient on a > PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive > Binary Arithmetic Coder" was the best choice for a video codec. > > CABAC requires 3 or 4 branches for every single _bit_ decoded, and the > last of these branches depends on the value of that decoded bit. > > Until you've made that branch you don't even know which context to apply > when decoding the next bit! The solution of course is that CABAC goes to a specific CABAC-decoding hardware ;-). CABAC decoders in FPGA can decode 1 symbol/cycle, with just 1300LEs (you can only fit two b16s in that space). There's absolutely no point in doing this with a CPU. Perhaps people should start putting an FPGA onto the CPU die for this sort of stateful bit-manipulation tasks. The other solution is to get on working on next generation codecs, and not repeating the same mistake. The IMHO right way to do it is to sort the coefficients by their context (i.e. first do a wavelet transformation), and then encode with a standard dictionary based entropy encoder like LZMA, which is compact and fast to decompress (for mobile encoders, you probably need to have an option to go with LZ77 for faster compression, but less dense results, or simply restrict the size of the dictionary). Wavelet transformations also make it easier to distribute different quality out of the same raw data (e.g. stream SD/HD/4k at the same time, where the SD and HD clients only extract the scaled down versions, and the HD/4k use the SD stream as base plus additional streams for higher resolution) - also helpful for video editing (preview the SD stream fast, render on the 4k stream). -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Bernd Paysan on 22 Oct 2009 08:49 Andy "Krazy" Glew wrote: > OK, OK, OK. This is not my area. But I would love to understand WHY > something like this cannot work. The problem with CMOS is that all the transistors have embedded diodes, that need to be reverse biased to make them operable. A transistor really is a four terminal device (source, drain, gate, *and* bulk). This sort of low-power reversible computation stuff is more for nano-scale electronics (using carbon nanotubes and whatever science fiction-like things you can imagine ;-) than for microelectronics. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Anton Ertl on 22 Oct 2009 10:14 Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> writes: >Paul Wallich wrote: >> I would be that a huge chunk of the time isn't in doing the actual >> calculations but in verifying that the calculations can be done. >> Spreadsheets are pretty much the ultimate in mutably-typed interactive >> code, and there's very little to prevent a recalculation from requiring >> a near-universal reparse. > >Wow, I hadn't thought of that. But if you are say running multiple >simulation runs, or something else where the only thing changing is the >value of some parameters, not the "structure" of the spreadsheet, does >Excel understand that it can skip at least most of the reparse? Probably not, because for most users Excel is fast enough even with slow algorithms. And those where it isn't, they have probably invested so much in Excel that most of them would not change to a spreadsheet program with better algorithms even if one is available. So there is no incentive for other spreadsheet programs to improve their algorithms, and therefore also no incentive for Excel. Concerning the structure of the spreadsheet, this changes only cell by cell, so any parsing should only have to deal with a cell at a time. Or if you have operations that deal with many cells (say, copying a column or loading a spreadsheet), it's reasonable that the time taken is proportional to the size of the change; and these operations are not the most frequent, so it's not so bad if they take a little time on huge spreadsheets. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Anton Ertl on 22 Oct 2009 10:42
Robert Myers <rbmyersusa(a)gmail.com> writes: >I've fiddled a little with the off-the-shelf Itanium compilers, but I >always assumed that none of those compilers were even remotely good >enough that you could expect just to run old software through them and >get anything like hoped-for performance. John Dallman has had a bit >to say on the subject here. > >When I talked about rewriting code, I meant just that, not merely >recompiling it. I wasn't all that interested in the standard task: >how do you feed bad code to an Itanium compiler and get acceptable >performance, because I was pretty sure that the answer was: you >don't. Bad code? For most software, performance is not that much of an issue, and the developers have left much more performance on the table than what switching between IA-64 and other architectures, or between different compilers for IA-64, or coding things in an IA-64-friendly manner would have bought. To get an idea of how Itanium II performs compared to other CPUs on code that's not tuned for it (at least not more than for any other architecture), take a look at the first graphics (slide 4) on http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-slides.pdf This is performance per cycle, and the benchmarks are prety CPU-bound. The compilers used here are various gcc versions (the fastest code produced by the various gcc versions available on the test machines is shown here). The only Gforth version that does not treat IA-64 as a generic architecture is 0.7.0, and the only thing that's IA-64-specific there is that it knows how to flush the I-cache. The performance per cycle of the Itanium II is not particularly good, but also not particularly bad. The only ones that are significantly faster on Gforth 0.7.0 are the IA-32 and AMD64 implementations, and that's because they have implemented at least a BTB for indirect branch prediction. Interestingly, the 21264B, which has a similar mechanism for branch prediction is barely faster per clock on this code than the Itanium II. On Gforth 0.5.0 Itanium II does ok. On Gforth 0.6.x the comparison is a little unfair, because some machines have the "dynamic superinstruction" optimization, while others don't have it. The Itanium II performs best among those that don't have it, but is much slower than those that have it. Ok, this is just one benchmark, you can see another benchmark on <http://www.complang.tuwien.ac.at/franz/latex-bench>; just some lines from there: Machine seconds - UP1500 21264B 800MHz 8MB L2 cache, RedHat 7.1 (b1) 3.28 - Intel Atom N330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323 - Athlon (Thunderbird) 900, Win2K, MikTeX 1.11d 2.306 - Athlon 64 X2 5600+, 2800MHz, 1MB L2, Debian Etch (64-bit) 0.624 - Xeon 5450, 3000MHz, (2*2*)6MB L2, Debian Etch (64-bit) 0.460 - iBook G4 12", 1066MHz 7447A, 512KB L2, Debian Sarge GNU/Linux 2.62 - PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47 - Sun Blade 1000, UltraSPARC-IIIi 900Mhz Solaris 8 3.09 - HP workstation 900MHz Itanium II, Debian Linux 3.528 Again, not great performance, but not extremely bad, either. If the others would not have beaten it on clock rate, it would have been competetive even on such applications compiled with gcc. >RedHat Enterprise still supports Itanium, so far as I know. Open >source depends on gcc, perhaps the cruftiest bit of code on the >planet. Yes, gcc will run on Itanium, but with what level of >performance? See above. >Could the open source community, essentially founded on >x86, turn on a dime and compete with Microsoft running away with >Itanium? Maybe with IBM's muscle behind Linux, open source would have >stood a chance, but I'm not so sure. After all, IBM would always have >preferred an Itanium-free world. Had I been at Microsoft, I might >have seen a Wintanium future as really attractive. Microssoft obviously did not see it that way, because they eventually decided against IA-64 and for AMD64. I don't know why they decided that way, but I see two flaws in your scenario: To run away with IA-64, Windows software would have had to run on IA-64 at all. Most of it is not controlled by Microsoft, and even the software controlled by Microsoft does not appear to be that portable (looking at the reported dearth of applications (including applications from Microsoft) for Windows NT on Alpha). In contrast, free software tends to be much more portable, so the situation on IA-64 would have been: Windows with mostly emulated applications against Linux with native applications. And would Microsoft have produced a better compiler for IA-64 than SGI? - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |