Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !
From: ChrisQ on 14 Sep 2009 16:21 nmm1(a)cam.ac.uk wrote: > Actually, things are getting worse. The problem is that floating-point > is increasingly being interpreted as IEEE 754, including every frob, > gizmo and brass knob. And the new version now specifies decimal; if > that takes off, there will be pressure to provide that, often as well > as binary - and there are two variants of decimal, too! > > IBM say that it adds only 5% to the amount of logic they need, but they > have a huge floating-point unit in the POWER series. In small chips, > designed for embedding, it's a massive overhead (perhaps a factor of > two for binary and three for decimal?) I should appreciate references > to any hard, detailed information on this. > > What is needed is a simplified IEEE 754 binary floating-point, which > would need less logic, be faster and have better RAS properties. It > wouldn't even be hard to do - it's been done, many times :-( > The last thing I need cluttering up an embedded cpu is floating point capability. Any math is done fixed point here and then translated to and from the external world. It's the only way to be confident about the accuracy. I don't really trust the standard C lib anyway and am even less likely to trust the float lib, where the sources, even when available, are probably untidy, uncommented, cryptic and thus opaque :-)... Regards, Chris
From: ChrisQ on 14 Sep 2009 16:48 Bernd Paysan wrote: > Indeed, e.g. LyX, a very friendly front end. Rendering a full book still > takes a bit of time, however mostly because book authors nowadays put so > many tricks into LaTeX that it sometimes requires 6 or 7 runs to get it > all sorted out ;-). > "Computer Modern Typefaces", was the book. If you don't have copy, it's worth paying for just to marvel at the attention to detail, apart from being interesting in its own right... Regards, Chris
From: "Andy "Krazy" Glew" on 19 Sep 2009 21:10 >> I believe you could indeed make a >> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to >> dedicated hw, but you would have to make sure that _nobody_ except the >> compiler writers ever needed to be exposed to it. Trouble is, you need about 3x the instruction fetch/decode/scheduling bandwidth. Since that is comparable to the actual instruction execution in terms of power, depending on your machine, it is by no means a clear win. You would need to be working on a code that allowed nearly all of the FP "primitive operations" to be optimized away for it to be a win on scalar code. On vector code, if the "FP primitive operations" are distributed over a larger vector, then the amortization of instruction fetch overheads may win. Anyway, this is nothing new. I investigated this with a mind to exposing the primitives to the compiler in the P6 era. Trouble is, the compiler had bigger fish to fry.
From: Brett Davis on 20 Sep 2009 02:01 In article <JY6dnaLYurTZJDvXnZ2dnUVZ_tqdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: > Lets look at the impact of implementing software FP, augmented by the > necessary HW support. > > 1. You give up 1/2 the registers. Typically FP instructions (implicitly) > use a different register set. This increases the number of names > available to a compiler, while reducing the orthogonality in the ISA. > What would you rather have: 32 register that can be used by both FP and > integer ops, or 32 registers that can be used by FP annd 32 register > that can be used by integer, with an additional cost to transfer between > them? For "best cost/performance" today I would put the FPU in the integer registers. The high end will have a wide vector processor that takes care of all the heavy computing for integer and floating tasks. The sixteen integer registers of AMD64 is plenty for a bunch of counters and pointers, the 32 registers of a RISC chip is silly/wasteful in this context. The low end will benefit from having a FPU without the huge costs of another register set and pipeline. (Twice bigger design with separate FPU.) 32 registers may be useful here, since they are shared with FPU ops, except that C/C++ will almost never use 16... On the really low end you will microcode the FPU ops and share the single adder and multiplier. Actually you will likely share the multiplier regardless, it is very expensive real-estate. As for the half software FPU idea, not a fan of it. Mostly because it has a tiny niche between no FPU and the microcoded FPU. Not a big enough market to pay for the hardware design, much less the compiler support. If just compared to the huge costs of a separate FPU register set and pipeline, yes it would make sense for a low end design. Also might make sense as a retrofit to an existing low end design. Though again I would rewrite the definition of FPU ops to just be in the integer registers. The instruction set will be different than the existing high end design with separate FPU, but that is going to be true anyway with the half FPU ops you invent. You of course need 64 bit register if you want to support double precision floats. Most embedded tasks are more than happy with single precision, so even a 32 bit core would benefit. FYI: The cost to move between FPU and integer registers can be a dozen cycles or more, lots more. (Think separate 20 cycle pipes that share data through the cache...) Brett
From: nmm1 on 20 Sep 2009 04:16
In article <4AB580FE.404(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >>> I believe you could indeed make a >>> 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to >>> dedicated hw, but you would have to make sure that _nobody_ except the >>> compiler writers ever needed to be exposed to it. > >Trouble is, you need about 3x the instruction fetch/decode/scheduling >bandwidth. Since that is comparable to the actual instruction execution >in terms of power, depending on your machine, it is by no means a clear win. Nobody claims that it is a clear win - certainly neither I nor Terje would. My assertion is that it would be better, overall, NOT solely for performance reasons - but no more than that. And you wouldn't need three times the instruction throughput, except for highly tuned HPC and benchmarketing. Few 'floating-point' codes have more than about 10% of their instructions actually executing floating-point operations. Remember that load and store don't count, and I said that I would also have a 'direct' comparison operation, too. When I last measured this (decades ago), it would have needed very little more instruction throughput, and RISC codes have more integer operations than the ones I looked at. >You would need to be working on a code that allowed nearly all of the FP >"primitive operations" to be optimized away for it to be a win on scalar >code. Not so. That would be true for a very few codes, but others would gain with little or no optimisation. For example, some codes spend half their time switching between the pipelines (yes, really), and others are dominated by calls to mathematical functions. By merging the pipelines, the overheads for the latter could be reduced very considerably. Now, working out the winners and losers, and by how much, would be part of the research project that this proposal would involve. Nobody is saying that it could be done by waving a magic wand. >Anyway, this is nothing new. I investigated this with a mind to >exposing the primitives to the compiler in the P6 era. Trouble is, the >compiler had bigger fish to fry. Yup. I never said that it was new - it predates my involvement in computing, and the reason you say is the reason it has never been restarted. Regards, Nick Maclaren. |