From: MitchAlsup on 12 Jun 2010 19:28 On Jun 12, 2:33 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > If you have kernel access so can hook INT3, what's wrong with > > FOR i FROM lowest byte of instruction TO highest byte DO > *b = INT3 (single byte trap instruction) > > FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO > *b = appropriate byte of new instrucftion Consider the case where an interested CPU has already fetched the first byte (or first several bytes) of said instruction and one of these fetched bytes happens to be a major opcode byte, but the rest of the instruction fetch gets delayed by this or that. There is no architectural specification that requires the fetch process to back up when an instruction cache line is stolen. Now your function comes in and writes INT 3 over the rest of the instruction, snatching the cache line from the interested CPU. Finally the delayed CPU finished fetching and now executes the instruction with all of its minor opcodes, mod/rm, sib, and constants containing INT 3 byte patterns. Nothing good will come of this. You have to prevent the "interested" CPU from fetching the first byte of the instruction before smearing INT 3's over the opcode space. The only chance yo have of making this work is to align this particular instruction on a cache line boundary.....which one cannot do for a random instruction Mitch
From: Andy 'Krazy' Glew on 12 Jun 2010 20:17 On 6/12/2010 4:28 PM, MitchAlsup wrote: > On Jun 12, 2:33 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: >> If you have kernel access so can hook INT3, what's wrong with >> >> FOR i FROM lowest byte of instruction TO highest byte DO >> *b = INT3 (single byte trap instruction) >> >> FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO >> *b = appropriate byte of new instrucftion > > Consider the case where an interested CPU has already fetched the > first byte (or first several bytes) of said instruction and one of > these fetched bytes happens to be a major opcode byte, but the rest of > the instruction fetch gets delayed by this or that. There is no > architectural specification that requires the fetch process to back up > when an instruction cache line is stolen. > > Now your function comes in and writes INT 3 over the rest of the > instruction, snatching the cache line from the interested CPU. > > Finally the delayed CPU finished fetching and now executes the > instruction with all of its minor opcodes, mod/rm, sib, and constants > containing INT 3 byte patterns. > > Nothing good will come of this. > > You have to prevent the "interested" CPU from fetching the first byte > of the instruction before smearing INT 3's over the opcode space. The > only chance yo have of making this work is to align this particular > instruction on a cache line boundary.....which one cannot do for a > random instruction Fair enough. There are no atomicity properties defined for instruction fetch. Perhaps there should be. Lacking this, the only safe way is to do s hottdown: stop all CPUs, do the write of the instruction bytes, perform a serializing instruction on each CPU to flush all instruction caches and prefetch (that is architecturally defined), and then restart. Thanks, Mitch. I had gone over this with the Pin people, but had forgotten. Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).) So I believe that the algorith, I describe wll work on Intel since P6, for WB memory. I think that it also will work for UC memory. (Glew's rule: any architectural property should work when caches are disabled.) But there is nothing official. --- By the way, this is an example of where making self-modifying code be recognized immediately, part of the memory ordering model, simplifies things.
From: Terje Mathisen "terje.mathisen at on 13 Jun 2010 07:31 Andy 'Krazy' Glew wrote: > Now, there are de-facto instruction fetch atomicity properties. But > nothing official. (E.g. on Intel (since P6), and I believe AMD, any > ifetch entirely within an aligned 16 bytes is atomic. And Intel (since > P6) will clear when the first byte is written; i.e. Intel recognizes SMC Intel have done so since the 486 I believe, definitely since the Pentium! From 8088 an dup to 386 you could measure the size of the instruction prefetch buffer by first doing a very long-running instruction that did not touch memory (i.e. DIV), then a REP STOSB which would overwrite the immediately following instructions with NOP bytes. Those bytes would all start out as single-byte INC reg opcodes, so the number of them that got executed was a pretty good indication of the size of the prefetch buffer. (I believe this all ran within a CLI/STI block, i.e. with interrupts disabled...) > immediately (as, I think, does AMD).) So I believe that the algorith, I > describe wll work on Intel since P6, for WB memory. I think that it also > will work for UC memory. (Glew's rule: any architectural property should > work when caches are disabled.) That's a _very_ good rule, since the opposite could make debugging the cpu _far_ worse. > > But there is nothing official. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 13 Jun 2010 08:23 In article <55see7-4re.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Andy 'Krazy' Glew wrote: >> Now, there are de-facto instruction fetch atomicity properties. But >> nothing official. ... > >> (Glew's rule: any architectural property should >> work when caches are disabled.) > >That's a _very_ good rule, since the opposite could make debugging the >cpu _far_ worse. All rules have exceptions, but they should be protected by barriers proportional to the loss of sanity involved in breaking them. That one should definitely be safe outside a maximum security enclosure (e.g. features provided for use only in machine-check handlers may need to break it, but those should NOT be used outside such code). >> But there is nothing official. >:-) Personally, I think the author of any normal code (including kernel) who relies on instruction fetch atomicity needs reeducation. Yes, I have done that, but it was a long time ago, the constraints were those of the 1970s, and I wouldn't do it again! Regards, Nick Maclaren.
From: MitchAlsup on 13 Jun 2010 09:53
On Jun 12, 7:17 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I > believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first > byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).) <snip> > > But there is nothing official. And there IS the problem. AMD Athlons and Opterons obey the self modifying code checks wrt instruction fetch, but do not on third party modifications of code simply because there is no spec as to what is required (and the addresses to be checked are quite distant from the addresses that need checking). {This was circa '07 and may have been changed.} When we investigated this issue (circa '05) there was no code sequence that would successfuly do this that would work on both Intel and AMD machines {IBM java JIT compiler triggered the investigation.} So, the compiler had to do a CPUID early and use result this to pick a code sequence later, as needed. >(Glew's rule: any architectural property should work when caches are disabled.) This rule will prevent bundling more than one instruction into a single atomic sequence, should you ever want such a longer atomic operation with multiple independent memory references. Thus something like ASF can only work when the instructions and data are both cacheable and both caches are turned on. {Note: I am not disagreeing with the general principle involved, but there are times.....} Mitch |