From: Stephen Sprunk on 10 Mar 2006 12:42 "Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message news:2006Mar5.094233(a)mips.complang.tuwien.ac.at... > "Stephen Sprunk" <stephen(a)sprunk.org> writes: >>There is no guarantee that any given pointer returned by malloc() or any >>syscalls is guaranteed to be in the 32-bit address space. > > Yes, one would have to extend a few system calls (those that allocate > address space) with a 32-bit flag (or have 32-bit variants of these > system calls); malloc() only gets its memory from system calls. As > Greg Lindahl wrote, this has been done in many OSs (and the IA-32 > (compatibility mode) support in AMD64 long mode OSs requires much > more effort). If it were done in the syscalls, that would help, but I was assuming there were no API changes, and that any reduction from 64 to 32 bits would have to be handled by the compiler. In that case, every pointer return from an external function would have to be checked and possibly remapped. Additionally, any API function that wanted/returned a 64-bit int would need to be adjusted, which costs a few cycles. There might be some workloads that benefit from all this work, but it seems like a high price to pay to reduce the cache impact of 64-bit pointers. > But my question was this: in which cases would the compiler have to > use more instructions for ILP32 programs in 64-bit mode than for > I32LP64 programs? Other than what I mentioned previously, I can't think of any. There's no 32-bit instructions that were _removed_ in the AMD64 set, though there's a few that require a longer encoding (e.g. INC, DEC). >>While the specific workload does cause variations, the cache effects of >>64-bit pointers are usually more than offset by the performance gained by >>having eight extra GPRs. > > In 64-bit mode, the compiler can use these GPRs (as well as the 8 > extra XMM registers). Right; I was comparing 64-bit mode to 32-bit mode and trying to explain why it's often worth it to use full 64-bit mode instead of 32-bit mode even if you had no use for 64-bit pointers or ints. Interestingly, the Linux kernel does something similar to what the OP asked. Due to sign extension, one can count on the kernel being located at -2GB to 0 in both 32-bit and 64-bit modes. The kernel does use 64-bit pointers when dealing with userland, but AFAIK only (negative) 32-bit pointers for its internals. Until the kernel exceeds 2GB, this makes supporting both modes transparently much easier. GCC even has a special compilation mode to do things this way (but it only works for kernel code). >>Slower cache, but less need to use it. > > Sounds like a fallacy to me. The cache accesses that the additional > registers avoid would all be cache hits. The cache misses that the > bigger pointers cause are not reduced by having more registers. Cache hits still cost a few cycles. If a program is constrained by having only six or seven registers available, it's going to have many loads from the stack; eliminating or reducing those loads improves performance. Register spill loads/stores also take up fetch/decode slots. Beyond that, having more registers means there should be fewer false dependencies, and compilers are free to hoist loads further up in the block (beyond the CPU's OOO window), reducing the impact of both cache misses and hits. Some of my terminology might be a little off, but this is what I've gleaned from lots of documentation and testing explanations. S -- Stephen Sprunk "Stupid people surround themselves with smart CCIE #3723 people. Smart people surround themselves with K5SSS smart people who disagree with them." --Aaron Sorkin *** Free account sponsored by SecureIX.com *** *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
From: Grumble on 12 Mar 2006 17:17 Stephen Sprunk wrote: > There's no 32-bit instructions that were _removed_ in the AMD64 set, > though there's a few that require a longer encoding (e.g. INC, DEC). I don't understand this statement. What are x86 and/or AMD64 32-bit instructions? Several IA-32 instructions are deprecated in AMD64. The following instructions are invalid in 64-bit mode: AAA-ASCII Adjust After Addition AAD-ASCII Adjust Before Division AAM-ASCII Adjust After Multiply AAS-ASCII Adjust After Subtraction BOUND-Check Array Bounds CALL (far absolute)-Procedure Call Far DAA-Decimal Adjust after Addition DAS-Decimal Adjust after Subtraction INTO-Interrupt to Overflow Vector JMP (far absolute)-Jump Far LDS-Load DS Segment Register LES-Load ES Segment Register POP DS-Pop Stack into DS Segment POP ES-Pop Stack into ES Segment POP SS-Pop Stack into SS Segment POPA, POPAD-Pop All to GPR Words or Doublewords PUSH CS-Push CS Segment Selector onto Stack PUSH DS-Push DS Segment Selector onto Stack PUSH ES-Push ES Segment Selector onto Stack PUSH SS-Push SS Segment Selector onto Stack PUSHA, PUSHAD-Push All to GPR Words or Doublewords The following instructions are invalid in long mode: SYSENTER-System Call (use SYSCALL instead) SYSEXIT-System Exit (use SYSRET instead) -- Regards, Grumble
From: Anton Ertl on 14 Mar 2006 12:14 "Stephen Sprunk" <stephen(a)sprunk.org> writes: >"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message >news:2006Mar5.094233(a)mips.complang.tuwien.ac.at... >> But my question was this: in which cases would the compiler have to >> use more instructions for ILP32 programs in 64-bit mode than for >> I32LP64 programs? > >Other than what I mentioned previously, I can't think of any. There's no >32-bit instructions that were _removed_ in the AMD64 set, though there's a >few that require a longer encoding (e.g. INC, DEC). There are some (listed by somebody else), but that does not matter for my question. I was not asking about compatibility mode vs. 64-bit mode for ILP32 programs, but about ILP32 vs. I32LP64 programs in 64-bit mode. >>>Slower cache, but less need to use it. >> >> Sounds like a fallacy to me. The cache accesses that the additional >> registers avoid would all be cache hits. The cache misses that the >> bigger pointers cause are not reduced by having more registers. > >Cache hits still cost a few cycles. If a program is constrained by having >only six or seven registers available, it's going to have many loads from >the stack; eliminating or reducing those loads improves performance. >Register spill loads/stores also take up fetch/decode slots. Sure, but my point was that someone might read into your statement that the number of cache misses may be reduced by having more registers, but that is not the case. >Beyond that, >having more registers means there should be fewer false dependencies, and >compilers are free to hoist loads further up in the block (beyond the CPU's >OOO window), reducing the impact of both cache misses and hits. Compilers usually do code motion and instruction scheduling before register allocation, so having more registers only helps alleviating the negative effects if such code motion; of course, with few registers available, the compiler writer might disable or weaken such optimizations. Concerning the OOO window, 8 additional registers won't help much given that the OOO window contains around 100 instructions in these CPUs. Moreover, unless the compiler does very aggressive scheduling (e.g., modulo scheduling), it will usually not move instructions beyond the OOO window (and it will probably not use such optimizations on architectures with only 16 registers). Concerning cache misses, dealing with that by hoisting a load very far up is a waste of a register; better use a prefetch instruction at the place where you would put the load, and schedule the load for an L1 hit. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
First
|
Prev
|
Pages: 1 2 3 4 5 6 7 Prev: interrupting for overflow and loop termination Next: "Livermore Loops" on x86 Linux |