From: Grumble on 12 Feb 2006 12:26 Niels J?rgen Kruse wrote: > Brian Hurt <bhurt(a)AUTO> wrote: > > >>On the x86, you get the advantage of 8 new registers in going to >>64-bit. This generally increases the speed of most programs by more >>than enough to overcome the decreased cache hit ratios 64 bits >>induces, which means that 64-bit code is generally 10-15% faster than >>the 32 bit code on the same hardware. > > > There is a difference between AMD and Intel CPUs here. On Intel, the 8 > registers subtract from the pool of rename registers, on AMD it is use'm > or lose'm. Could you elaborate?
From: Andy Glew on 12 Feb 2006 16:07 nospam(a)ab-katrinedal.dk (Niels J?rgen Kruse) writes: > Brian Hurt <bhurt(a)AUTO> wrote: > > > On the x86, you get the advantage of 8 new registers in going to > > 64-bit. This generally increases the speed of most programs by more > > than enough to overcome the decreased cache hit ratios 64 bits > > induces, which means that 64-bit code is generally 10-15% faster than > > the 32 bit code on the same hardware. > > There is a difference between AMD and Intel CPUs here. On Intel, the 8 > registers subtract from the pool of rename registers, on AMD it is use'm > or lose'm. It is more generic than just AMD vs. Intel. The Intel P6 family (including the Pentium M family) have a separate ROB and RRF. Architecural registers live in the RRF - the RRF grows bigger. The ROB holds data, and defines the size of the instruction window. As you report, AMD's K7 and K8 work the same way. The Intel Pentum 4 family (Wmt, Nwd, Psc) have a unified PRF - there is a single pool of physical registers used to hold both architectural and rename registers. If there is a ROB, it is dataless, just used for bookeeping. There is no need to copy data values from ROB to RRF. With a unified PRF machine, if you increase the number of architectural registers you decrease the instruction window size. Worse, you remove rename registers even if those architectural registers are not in use. Many people naively say "increasing registers improves performance". The tradeoff is not so simple when it reduces speculation. --- Note: many people confuse this issue with the separate, albeit related, issue of where in the pipeline you read the PRF (or ROB+RRF). The P6 family reads the ROB+RRF before placing values into the reservation stations, RS, and relies on capturing data values while an operation is pending in the RS. The Wmt family reads the PRF after dispatching from the scheduler.
From: Bernd Paysan on 12 Feb 2006 19:13 Andy Glew wrote: > With a unified PRF machine, if you increase the number of > architectural registers you decrease the instruction window size. > Worse, you remove rename registers even if those architectural > registers are not in use. But the register window is huge compared to the number of architectural registers. > Many people naively say "increasing registers improves performance". > The tradeoff is not so simple when it reduces speculation. It also depends on how costly the additional loads and stores are (for the reduced register version), or if they are basically "for free", since the inherent ILP isn't that high, and the additional loads and stores just fill up otherwise empty slots. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Andy Glew on 14 Feb 2006 13:21 Bernd Paysan <bernd.paysan(a)gmx.de> writes: > Andy Glew wrote: > > With a unified PRF machine, if you increase the number of > > architectural registers you decrease the instruction window size. > > Worse, you remove rename registers even if those architectural > > registers are not in use. > > But the register window is huge compared to the number of architectural > registers. I wish. Consider a Willammette/Psc/K8 era machine. With x86-64 roughly 64 architectural registers (exactly how many depends). Instruction window sizes (rob sizes) of circa 128 on current machines, on the high end. 64 architectural registers. 128 renamed registers. => the architectural registers would be roughly 1/3 of a 128+64 entry PRF. Half of a 128 entry PRF. Half of a 256 entry PRF with threading. (For simplicitly, we will assume unified, not split into integer/FP. If you split, pretty much the same thing works out.) Some people have proposed adding more architrectural registers. I ask them: "are you happy reducing the register window by half"? --- One of our big problems right now is that we have too many lregs - too many architectural registers. Not only do they waste space in the PRF, making it bigger and slower; they also waste space in the renaming tables. I have said, in this forum, for nigh on years now: eventually we must go to multilevel register files. A small L1 register file that has lots of ports. A large L2 register file that has fewer ports. Hell, it could even be in main memory. Same multilevel principle applies to renaming tables. If you want to add more architectural registers, add them to the L2 PRF. --- It's ironic: we might well have better performance - larger instruction windows - if we had fewer architectural registers.
From: Anton Ertl on 14 Feb 2006 13:55
Andy Glew <first.last(a)employer.domain> writes: >With x86-64 roughly 64 architectural registers How do you compute that? I compute: 16 GPRs 16 xmm 8 387/mmx -- 40 As for the K7/K8 having the rename registers in addition to the architectural ones, I read somewhere that one variant added new architectural FP registers without increasing the physical FP registers, and only a later variant increased the number of physical FP registers to restore the available rename registers to the original number. I don't remember if the additional architectural registers were the first 8 XMM registers (then the two variants would be the Palomino and the K8), or if it was the second 8 XMM registers (then the two variants would be the original K8 and some shrink). - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |