Prev: Ordering Tramadol online without prescription. Cod saturday Tramadol. Free fedex delivery Tramadol. Doctor shopping for Tramadol prescription.
Next: ca-fest Portland: Anyone going to Supercomputers '09 in Portland?
From: nmm1 on 19 Nov 2009 04:58 In article <ggtgp-1B5ECC.00221217112009(a)netnews.asp.att.net>, Brett Davis <ggtgp(a)yahoo.com> wrote: >In article <4B00EB3A.3060700(a)patten-glew.net>, > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote: > >> I must admit that I am puzzled by using loop-iteration SpMT if you can >> do the bypassing between the clusters. I guess that you are using that >> "batching" to hopefully reduce inter-cluster bypassing. But then I am >> not a big fan of inner loop SpMT. Loops are easy, and are already done >> pretty well. > >"Loops are easy." ;) Pray tell where these plentiful easy to speedup >areas are in CPU design. ;) Run strait into a brick wall you have. ;) What I call vectorisable codes. A solved problem since the 1970s. Unfortunately, outside HPC, they are rare - and, even in that, they don't dominate. Regards, Nick Maclaren.
From: Mayan Moudgill on 19 Nov 2009 06:34 Niels J�rgen Kruse wrote: > Andy "Krazy" Glew <ag-news(a)patten-glew.net> wrote: > > >>The physical layout matters a lot, and hence has its own terminology. >> >>For example, most datapaths are bit interleaved - the actual wires might >>look like ..... >>Bit interleaving makes bypassing of same bit to same bit easier. >> >>Unfortunately, bit interleaving introduces O(N^2) factor to the area. >>Doesn't matter for small degree of superscalarness, but matters as you >>get bigger, as you become wire limited. >> ..... >> >>For a while I was puzzled by IBM's Power family, e.g. Power 4, which >>seemed to be on the low side. Until I was told that there were extra >>wires not related to superscalarness, for DMA, etc., > > > Die Photos of PPC970, POWER4 and POWER5 clearly have 2 identical integer > units replicated, so I doubt they are interleaved. > I'll preface this response by saying that I'm not very sure that a good full custom design might not be able to work around some of the problems, and that I am not practiced in this area. Using bit interleaving for bypassing works something like this for two ALUs, each with 2 inputs and 1 output. +---------- v XXXX | YYYY ---> MX3==============> YYYY ^ XXXX | YYYY <=====+======================= v XXXX | YYYY || ---> MX3 --> XXXX | YYYY || ^ XXXX | YYYY || <-----+------------ YYYY || v XXXX | YYYY || ---> MX3 =============> YYYY || ^ XXXX | YYYY || +======================= v XXXX | YYYY ---> MX3 --> XXXX | YYYY ^ XXXX | YYYY +------------ Here X's and Y's are the logic for the two ALUs, the --- are the wires for the X ALU input and output, the === are the wires for the Y ALU input and output. Control wires are not shown. This cell will be (kind of) replicated 32 times for a 32-bit ALU. Nothing is to scale; in particular, in real life the muxes might stretch all the way across the height of the cell, and the wires would be much closer together (or even ontop of each other). At the left we have the path to the register file. The register file and the result values feed four 3-1 bypass muxes; two select the X inputs and two select Y inputs. The X/Y inputs are then sent to the appropriate ALU cell. [The reason why the cells can't be replicated exactly are the propagate/generate trees for adds and the compare result trees; hence the empty space between the cells and muxes] Now, consider where this is useful: - X must be tall so that there is space for the wires to go over it - X must be skinny so that the additional wire delay does not impact cycle time - X must not have many layers of metal so that it is possible to route Y's wires over it - Y must not have too large a cycle time, so that the additional wire delay (of routing over X) does not kill performance - X and Y must have roughly the same height. It pretty much restricts this to simple ALUs; add/subtracts are probably as complex operations as you can sustain. There is very little chance that X can contain shifters or multipliers - that'll violate the skinny and the layers of metal restrictions. Y can, in general, be a little more complex than X. But, again, if Y is too much more complex than X, then Y's implementation may have to be contorted to fit the height of X. Also, Y cannot use a full cycle [it still has the additional routing delay over X], so it really can't be a very complex unit. Bit lane interleaving seems, to me, to be a somewhat overly restrictive and impractical optimization technique. It will let you be blindingly fast on a small subset of the total op space. I do not know anything about the exact thinking that went into the POWER aarchitects choices for integer unit layout. However, given that I probably got my attitudes from them, they may have shared my opinions :)
From: Stephen Fuld on 19 Nov 2009 14:35 nmm1(a)cam.ac.uk wrote: > In article <he1hiv$jmf$1(a)news.eternal-september.org>, > Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote: >> So the structure is a two dimensional table, one dimension indexed by n, >> and the other implementation dependent, but treated as an LRU? What >> does each entry in the table contain? It must have the hash, and >> presumably a few (4?) bit field to indicate prediction direction and >> confidence.. Anything else? > > No, and it needn't even be that. The 'n' need not be stored in the > structure. The hash is looked up and the confidence value (which > includes the direction) copied to the separate register file that > is indexed by n (and, again, contains just a confidence value). OK, then a one dimensional table of m entries each containing a hash value and a confidence value. m is architecture independent; larger values would keep more "history". And a separate structure, n deep containing the confidence interval. snip > What I am describing is an approach that might lead to > better prediction for O-O codes. But the details would be modified > according to whatever techniques are already known to be effective > for branch prediction, and as a result of simulation. > > My key idea is to get away from using just the instruction address > and thread index for branch prediction, and to use the address of > the object being worked on (or whatever the compiler regards as > appropriate). Really no more than that. But doing even that needs > some handshaking between the compiler and the hardware. I understand the goal. While I am still not sure of the details of your proposal, let me make some comments that, I think are applicable. I have a limited understanding of hardware design, so someone else, please correct me. Depending upon the number of registers used in the hash, the instructions could get expensive. For example, if you want to hash five registers, I think you would need at least three cycles (at two register read ports per instruction cycle) just to read the values from the registers. Then you have an "associative memory" lookup of the m values in the LRU structure. With sufficient hardware, this could occur in parallel, but for reasonable sized m this is a lot of hardware. So it might add more cycles to the instruction execution. I believe the LRU updating of the structure, which is expensive, could be done after the instruction completes, thus not impacting instruction execution time. It would limit the rate at which such instructions could be issued, but that may not be a problem. But if the instructions take, say 3 cycles, is the improvement in branch prediction worth the cost? On the other hand, what about a simplified version - something like the following. The branch predictor would use a few predefined bits of a predefined register appended to the branch address to use in indexing the predictor table. Since branch instructions often don't use the second register read port, it might not extend the prediction time at all. The compiler would know which register was used, of course, and would try to keep that register as the "object id", or object address, or some unique value. If the value was not wanted for the prediction, the compiler could add an instruction to zero it out prior to the branch. This extra instruction might be almost free as it has no dependencies and could be easily scheduled. This might get some of the advantages of your proposal, but with a lot less hardware and complexity. > Oh, the other main idea is 'advisory instructions' only, so that > a CPU could ignore them without behavioural changes. But that isn't > a new idea. Agreed. -- - Stephen Fuld (e-mail address disguised to prevent spam)
From: nmm1 on 19 Nov 2009 15:00 In article <he46lk$8ph$1(a)news.eternal-september.org>, Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote: > >I have a limited understanding of hardware design, so someone else, >please correct me. Depending upon the number of registers used in the >hash, the instructions could get expensive. For example, if you want to >hash five registers, I think you would need at least three cycles (at >two register read ports per instruction cycle) just to read the values >from the registers. ... I have none :-) If I were running that project, I would first run simulations to discover if it had enough potential to be worth doing some serious analysis. And one of the aspects of that would be just how simple a hash gave most of the benefit. It might well be just the bottom 8 bits of a single register! My take on it is by looking at the code of O-O method functions, and noting that a lot of the conditional branches are controlled by a single factor. That isn't always explicit in the data, but that's no obstacle. Regards, Nick Maclaren.
From: Terje Mathisen on 20 Nov 2009 03:32
nmm1(a)cam.ac.uk wrote: > SET_PRED n<registers> > Sets predictor n to an undefined hash of the register values. > USE_PRED n<confidence> > Uses that predictor, with confidence between (say) -7 and 7; negative > values imply the confidence of it not being taken. > PUSH_PREDS<predictors> > POP_PREDS<predictors> > Used at function calls and elsewhere to indicate the change in use. > Simply clearing a predictor is always a permitted action, and most > hardware would keep only a very small stack of them. > > And, yes, my approach is specifically aimed at O-O languages, where > the same code is used for different objects, and where the path is > dependent more on the object than the instruction. Would it be a Isn't it easier to "simply" agree with the compiler writers that the self pointer will always be loaded into (say) ECX/RCX, and then let the content of that register, along with the EIP/RIP code address, control the BTB lookups? If you're already going to have a multi-level branch predictor, with a top-level chooser to determine which one to use, then you could have this object type predictor be one of the lower-level ones. This at least is both forward and backwards compatible with existing code and does not require any instruction set changes. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |