From: "Andy "Krazy" Glew" on 18 Jan 2010 02:08 I wrote th following for my wiki, http://semipublic.comp-arch.net/wiki/SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET and thought thgat USEnet comp.arch might be interested: My bad. I defined the Intel P6's SYSENTER/SYSEXIT instructions. As I will explain below, they went a bit too far. AMD's more conventional SYSCALL/SYSRET instructions were more successful. No secrets here: the instructions have obviously been published, and the motivations have been documented in the patents. SYSENTER/SYSEXIT were motivated by the following: System calls are really just function calls. With security. They have to switch stacks, etc. Call instructions are really quite CISCy. They save the current instruction pointer, and then transfer. That's at least two operations. Many RISC instruction sets have given up CALL instructions that use a stack in favor of [[branch-and-link with call hint]] - an instruction that saves the current program counter in a register, and then branches. It has a hint to indicate that it is likely to return to after the call point. On the P6 microarchitecture, even branch-and-link really wanted to be two uops: # reg := current_IP # IP := target In my previous job at Gould I had maintained and optimized the low level code - the user level library stubs (which I had worked on inlining), and the code they transferred to in the kernel. The target for a system call is not specified by the user. Instead, it tends to be a single hardwired entry point. Sometimes the user specifies a vector number, i.e. a system call number, which might be used to vector code - but usually most of the different system calls share common code at the beginning, the system call prologue, so direct vectoring is not a win. Indeed, one often sees directly vectored system calls immediately save the vector number into a register, and then branch to a comon routine, and only much later vector diverge again. At Gould, at least, there was only a single library that was called by the user to transfer to the kernel. It was always mapped at the same address, whether it was shared or process private. In this situation I observed that it was redundant to have the system call instruction save the instruction pointer. The instruction pointer of the user code that was calling the system call was already saved on the user stack, and the system call stub libraries were always at the same address. Similarly, it was also known what address to return to. Therefore, I decided to create, not SYSCALL/SYSRET, but SYSENTER/SYSEXIT. I considered something almost if not exactly the same as AMD SYSCALL/SYSEXIT, but decided to be more aggressive, more RISCy, and define SYSENTER/SYSEXIT. SYSENTER just changed privilege levels, transferring to an address specified in a register. SYSRET did the reverse. Because x86 required the stack to be changed, SYSENTER/SYSEXIT had to define SS:ESP as well as CS:EIP. Hardwired values were used whenever possible. Observe: the original idea was to change just the program counter. Of course, in x86 this becomes CS:EIP, which is reasonable. But we start sliding down the slippery slope when we have to change SS:ESP as well. We aren't allowed to just block interrupts while kernel code loads a new SS:ESP. It turns out that there are consistency checks; e.g. NMI might panic if it occurred when there was a privileged CS:EIP and an unpriviliged SS:ESP. I have observed that this concern about interrupts that cannot be blocked is a key source of complexity in system architecture. The RISC approach may be to assume that all interrupts, even NMIs, can be blocked, broefly, as the syscall code sets things up. But the advent of things like virtual machines, SMIs, etc., means that you can't make this assumption. So, the original concept for SYSENTER was CS:EIP := some-register + some-hardwired-values which you might have been able to do in a single uop on a P6 that had [[segment-register-renaming]], and which co-renamed CS:EIP together. But this became CS:EIP := some-register + some-hardwired-values SS:ESP := some-register + some-hardwired-values Now, this is unlikely to ever be a single uop on a reasonable micro-dataflow OOO machine that is limited to writing only one destination at a time. But it is still reasonably fast. But then we go down the slippery slope: segment-register-renaming was die dieted out of the original P6, and the above had to be expressed using existing P6 segment microcode. P6 decided not to rename the privilege level, or the interrupt blocking flag, so pipeline draining was required. So, what I had hoped could be a single uop, possibly single cycle, SYSENTER instruction, became first 2 uops, then ... many more. I think the fastest it could have been on P6 was 15 cycles. Faced with these lossages, the extra overhead of SYSCALL is negligible. ecx := EIP CS:EIP := some-register + some-hardwired-values SS:ESP := some-register + some-hardwired-values In the best case, 3 single cycle uops rather than 2. At one point in the design of P6, this would have restricted SYSCALL quite a bit, since the original plan was for P6 to have a 422 decoder templte - and a 3 uop syscall could only have executed on decoder 0, whereas a 2 uop syscall could have executed on any decoder. But when P6 adopted a 411 decoder template, this putative advantage for SYSENTER was lost. And when SYSENTER and SYSCALL both turned into microcoded monstrosities... While SYSENTER might have had a performance advantage over SYSCALL in some reasonably aggressive implementations, in the actual implementation the advantage was neglible. And SYSCALL, saving the user IP, was just plain more familiar. Conclusion: there were reasons for SYSENTER, but it was probably a step too far. -- I still think that both Intel and AMD missed a big opportunity, to make system calls truly as fast as function calls. Chicken and egg. Nobody wants to make the investment in hardware without a proven software benefit, but existing software is optimized to avoid expensive system call privilege level changes. -- See [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21086.pdf AMD SYSCALL/SYSEXIT definition]
From: Anton Ertl on 18 Jan 2010 04:59 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >I still think that both Intel and AMD missed a big opportunity, to make >system calls truly >as fast as function calls. Chicken and egg. >Nobody wants to make the investment in hardware without a proven >software benefit, >but existing software is optimized to avoid expensive system call >privilege level changes. But given that system calls have to do much more sanity checking on their arguments, and there is the common prelude that you mentioned (what is it for?), I don't see system calls ever becoming as fast as function calls, even with fast system call and system return instructions. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on 18 Jan 2010 05:17 In article <2010Jan18.105904(a)mips.complang.tuwien.ac.at>, Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: >"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >>I still think that both Intel and AMD missed a big opportunity, to make >>system calls truly >>as fast as function calls. Chicken and egg. >>Nobody wants to make the investment in hardware without a proven >>software benefit, >>but existing software is optimized to avoid expensive system call >>privilege level changes. > >But given that system calls have to do much more sanity checking on >their arguments, and there is the common prelude that you mentioned >(what is it for?), I don't see system calls ever becoming as fast as >function calls, even with fast system call and system return >instructions. It's been done, and the gains can be fairly high - unfortunately, more in maintainability than performance, so benchmarketing classifies such changes as undesirable :-( The key is to have a clean system design, so the amount of sanity checking and the size of a standard prelude are minimal. For example, a high proportion of system calls in many applications can be very simple, 'unprivileged' ones like reading the clock or debugger hooks. There is no reason that the former shouldn't be as fast as a function call! Whether you can get there starting from here (i.e. the x86) is another matter .... Nobody cares much about the cost of the very heavyweight ones, because any application that uses them much is broken by design. Regards, Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on 18 Jan 2010 07:13 Anton Ertl wrote: > "Andy \"Krazy\" Glew"<ag-news(a)patten-glew.net> writes: >> I still think that both Intel and AMD missed a big opportunity, to make >> system calls truly >> as fast as function calls. Chicken and egg. >> Nobody wants to make the investment in hardware without a proven >> software benefit, >> but existing software is optimized to avoid expensive system call >> privilege level changes. > > But given that system calls have to do much more sanity checking on > their arguments, and there is the common prelude that you mentioned > (what is it for?), I don't see system calls ever becoming as fast as > function calls, even with fast system call and system return > instructions. _Some_ system calls don't need that checking code! I.e. using a very fast syscall(), you can return an OS timestamp within a few nanoseconds, totally obviating the need for application code to develop their own timers, based on RDTSC() (single-core/single-cpu systems only), ACPI timers or whatever else is available. Even if this is only possible for system calls that deliver very simple result, and where the checking code is negligible, this is till an important subset. The best solution today is to take away all attempts on security and move all those calls into a user-level library, right? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: mac on 18 Jan 2010 07:50
> I have observed that this concern about interrupts that cannot be > blocked is a key source of complexity in system architecture. The RISC > approach may be to assume that all interrupts, even NMIs, can be > blocked, broefly, as the syscall code sets things up. But the advent of > things like virtual machines, SMIs, etc., means that you can't make this > assumption. Didn't Alpha PALcode have someting like this? Special execution environment, no interrupts, priveleged register access? I don't know much about it, but it looked like a clever hook for CISC operations. |