From: Betov on 10 Sep 2005 04:11 "sevagK" <kahlinor(a)yahoo.com> ýcrivait news:1126299080.429453.180180 @g49g2000cwa.googlegroups.com: > On another note, congratulations on finally implementing the jump table > optimizations after Randall showed that RosAsm lacked this ability and > suffered because of it (even if you didn't do a complete job of it, we > have come to expect half finished features in RosAsm). > You can thank Randy for influencing yet another inovation for RosAsm > which may soon be able to match features with the other assemblers. > By the way, no one is going to buy your nonesence that this was > "planned > since day one." You should stop smoking the curtains. Betov. < http://rosasm.org >
From: randyhyde@earthlink.net on 11 Sep 2005 10:12 Betov wrote: > 4) Yes, of course, in the "Multi-Pass-Emulation", all of > the Tables (and the Code Section, and all) are entirely > compacted, before scanning the Map Table again, in the > Main Loop, which is a trivial as: > > JmpsOptimize: > call ScanShortenJmpsTable > > While D$NumberOfLongToShort <> 0 > call CompactLabelListAndCodeRef > call CompactCodeList > > call CompactShortenJmpsTable > > call ScanShortenJmpsTable > End_While > ret > Just for your information, this *is* a multi-pass algorithm. Your argument that you're doing a single-pass is incorrect. Indeed, if I interpret the above correctly, this is a typical O(n**2) algorithm with the limitation (and not a bad one, I might point out) that once a branch is shortened, it is never again considered (that is, you won't lengthen it once it has been shortened). Fast, but doesn't guaranteed optimal code (and no *tractible* algorithm will guaranteed optimal code, so again, this limitation isn't that bad). Cheers, Randy Hyde
From: randyhyde@earthlink.net on 11 Sep 2005 14:10 Elsewhere in this thread, Rene has mentioned that he uses a "start long, optimize short" algorithm for branch displacement optimization. This scheme has the advantage that if you ever prematurely stop during the optimization process, the program will still run correctly. However, this algorithm often produces results that are not as optimal as they could be. Consider the following (HLA) code: program t; #include ("stdlib.hhf") static lblCnt :uns32; begin t; B0: jmp L0; B1: jmp L1; B2: jmp L2; B3: jmp L3; B4: jmp L4; B5: jmp L5; B6: jmp L6; B7: jmp L7; B8: jmp L8; B9: jmp L9; B10: jmp L10; B11: jmp L11; B12: jmp L12; B13: jmp L13; B14: jmp L14; B15: jmp L15; B16: jmp L16; B17: jmp L17; B18: jmp L18; B19: jmp L19; B20: jmp L20; B21: jmp L21; B22: jmp L22; B23: jmp L23; B24: jmp L24; B25: jmp L25; // Magic statement jmp L26; L0: jmp B0; L1: jmp B1; L2: jmp B2; L3: jmp B3; L4: jmp B4; L5: jmp B5; L6: jmp B6; L7: jmp B7; L8: jmp B8; L9: jmp B9; L10: jmp B10; L11: jmp B11; L12: jmp B12; L13: jmp B13; L14: jmp B14; L15: jmp B15; L16: jmp B16; L17: jmp B17; L18: jmp B18; L19: jmp B19; L20: jmp B20; L21: jmp B21; L22: jmp B22; L23: jmp B23; L24: jmp B24; L25: jmp B25; L26: end t; If all of these statements are encoded as short jumps (even several more, in fact), then all the branches are within range. However, if they are all encoded as far branches, then no single displacement will be within the +/- 128 byte range allowed by a single-byte displacement. The problem with starting long and optimizing short is that an assembler that scans through these instructions encoded long will never see the opportunity to optimize these jumps -- they are all out of range as far as the assembler is concerned. However, an assembler that "starts short, extends long (as necessary)" will produce the shortest form of this instruction code. This particular code is "magic" insofaras the "optimal/unoptimal" versions of the code can be selected by changing just one statement (the "magic" statement). Make this jump a short jump (or replace it by two NOPs which would take the same amount of space as a short jump) and all the instructions get encoded in the short form by a typical "start long, optimize short" assembler. Put in a long jump, and the assembler will probably leave them all long (which is *not* optimized, obviously). RosAsm uses a "start long, optimize short" branch displacement optimizer, but interesting enough it won't even produce an optimal version of this program if you sustitute a short branch for the jump. Problem #1 is that RosAsm only optimizes conditional jumps, it does not also optimize absolute jumps. But even if you change all these absolute jumps to conditional jumps, RosAsm's arithmetic seems to be off and it will not produce optimal code until you subtract a small number from the displacements. For example, the RosAsm program: Main: B0: jc L0>>; B1: jc L1>>; B2: jc L2>>; B3: jc L3>>; B4: jc L4>>; B5: jc L5>>; B6: jc L6>>; B7: jc L7>>; B8: jc L8>>; B9: jc L9>>; C0: jc M0>>; C1: jc M1>>; C2: jc M2>>; C3: jc M3>>; C4: jc M4>>; C5: jc M5>>; C6: jc M6>>; C7: jc M7>>; C8: jc M8>>; C9: jc M9>>; D0: jc N0>>; ;jc N6> L0: jc B0<<; L1: jc B1<<; L2: jc B2<<; L3: jc B3<<; L4: jc B4<<; L5: jc B5<<; L6: jc B6<<; L7: jc B7<<; L8: jc B8<<; L9: jc B9<<; M0: jc C0<<; M1: jc C1<<; M2: jc C2<<; M3: jc C3<<; M4: jc C4<<; M5: jc C5<<; M6: jc C6<<; M7: jc C7<<; M8: jc C8<<; M9: jc C9<<; N0: jc D0<<; N6: push 0 call 'Kernel32.ExitProcess' Produces the following object code (disassembled by dumpbin): 00403000: 0F 82 7B 00 00 00 jb 00403081 00403006: 0F 82 7B 00 00 00 jb 00403087 0040300C: 0F 82 7B 00 00 00 jb 0040308D 00403012: 0F 82 7B 00 00 00 jb 00403093 00403018: 0F 82 7B 00 00 00 jb 00403099 0040301E: 0F 82 7B 00 00 00 jb 0040309F 00403024: 0F 82 7B 00 00 00 jb 004030A5 0040302A: 0F 82 7B 00 00 00 jb 004030AB 00403030: 0F 82 7B 00 00 00 jb 004030B1 00403036: 0F 82 7B 00 00 00 jb 004030B7 0040303C: 0F 82 7B 00 00 00 jb 004030BD 00403042: 0F 82 7B 00 00 00 jb 004030C3 00403048: 0F 82 7B 00 00 00 jb 004030C9 0040304E: 0F 82 7B 00 00 00 jb 004030CF 00403054: 0F 82 7B 00 00 00 jb 004030D5 0040305A: 0F 82 7B 00 00 00 jb 004030DB 00403060: 0F 82 7B 00 00 00 jb 004030E1 00403066: 0F 82 7B 00 00 00 jb 004030E7 0040306C: 0F 82 7B 00 00 00 jb 004030ED 00403072: 0F 82 7B 00 00 00 jb 004030F3 00403078: 0F 82 7B 00 00 00 jb 004030F9 0040307E: 0F 82 7C FF FF FF jb 00403000 00403084: 0F 82 7C FF FF FF jb 00403006 0040308A: 0F 82 7C FF FF FF jb 0040300C 00403090: 0F 82 7C FF FF FF jb 00403012 00403096: 0F 82 7C FF FF FF jb 00403018 0040309C: 0F 82 7C FF FF FF jb 0040301E 004030A2: 0F 82 7C FF FF FF jb 00403024 004030A8: 0F 82 7C FF FF FF jb 0040302A 004030AE: 0F 82 7C FF FF FF jb 00403030 004030B4: 0F 82 7C FF FF FF jb 00403036 004030BA: 0F 82 7C FF FF FF jb 0040303C 004030C0: 0F 82 7C FF FF FF jb 00403042 004030C6: 0F 82 7C FF FF FF jb 00403048 004030CC: 0F 82 7C FF FF FF jb 0040304E 004030D2: 0F 82 7C FF FF FF jb 00403054 004030D8: 0F 82 7C FF FF FF jb 0040305A 004030DE: 0F 82 7C FF FF FF jb 00403060 004030E4: 0F 82 7C FF FF FF jb 00403066 004030EA: 0F 82 7C FF FF FF jb 0040306C 004030F0: 0F 82 7C FF FF FF jb 00403072 004030F6: 0F 82 7C FF FF FF jb 00403078 004030FC: 6A 00 push 0 004030FE: FF 15 30 10 40 00 call dword ptr ds:[00401030h] As you can see, the branch displacements are all well within the +/- 128 byte range, yet RosAsm *still* uses long displacements ($7b and -$7b). However, if I delete two JC instructions (at labels D0 and N0), then I get the following: 00403000: 72 26 jb 00403028 00403002: 72 26 jb 0040302A 00403004: 72 26 jb 0040302C 00403006: 72 26 jb 0040302E 00403008: 72 26 jb 00403030 0040300A: 72 26 jb 00403032 0040300C: 72 26 jb 00403034 0040300E: 72 26 jb 00403036 00403010: 72 26 jb 00403038 00403012: 72 26 jb 0040303A 00403014: 72 26 jb 0040303C 00403016: 72 26 jb 0040303E 00403018: 72 26 jb 00403040 0040301A: 72 26 jb 00403042 0040301C: 72 26 jb 00403044 0040301E: 72 26 jb 00403046 00403020: 72 26 jb 00403048 00403022: 72 26 jb 0040304A 00403024: 72 26 jb 0040304C 00403026: 72 26 jb 0040304E 00403028: 72 D6 jb 00403000 0040302A: 72 D6 jb 00403002 0040302C: 72 D6 jb 00403004 0040302E: 72 D6 jb 00403006 00403030: 72 D6 jb 00403008 00403032: 72 D6 jb 0040300A 00403034: 72 D6 jb 0040300C 00403036: 72 D6 jb 0040300E 00403038: 72 D6 jb 00403010 0040303A: 72 D6 jb 00403012 0040303C: 72 D6 jb 00403014 0040303E: 72 D6 jb 00403016 00403040: 72 D6 jb 00403018 00403042: 72 D6 jb 0040301A 00403044: 72 D6 jb 0040301C 00403046: 72 D6 jb 0040301E 00403048: 72 D6 jb 00403020 0040304A: 72 D6 jb 00403022 0040304C: 72 D6 jb 00403024 0040304E: 72 D6 jb 00403026 00403050: 6A 00 push 0 00403052: FF 15 30 10 40 00 call dword ptr ds:[00401030h] And now the assembler optimizes the branches to two bytes each (from six). Also note that the displacement changes from $7b to $26 (and their negatives). The difference between $7b and $26 is considerable. We should be able to stick a *ton* of additional instructions in here without forcing these jumps to go from two to six bytes each! Clearly, RosAsm is not doing a very good job of optimization here. I'm picking on RosAsm, but the truth is that this problem exists for *any* assembler that uses a "start large, optimize small" heuristic. Even MASM is not free from this problem (though it does a better job of optimization than RosAsm does). One *big* advantage of "start large, optimize small" is that you can stop the optimization process at any time and *still* have a working program. Because branch displacement optimization is inherently an NP-Complete problem, there exist certain cases where the number of iterations one would have to execute to optimize the code would be far too great. An assembler that starts large and optimizes small could have a loop counter and it could stop the optimization process when the loop exceeds some value. This could limit the number of iterations while still providing a reasonable level of optimization (do keep in mind that the majority of the optimizations are going to occur on the *first* pass of this algorithm). In theory, a "start small, work larger" algorithm could be forced to run (almost) forever optimizing some large programs. In practice few optimization algorithms implement a truly NP-Complete algorithm and as such, they complete in polynomial time (meaning even if they work slowly, they are tractible solutions). They don't guarantee optimal results, but they still tend to produce better results than "start large, optimize small" algorithms and usually they run much faster because most of the branches in a typical program *are* small. Back to RosAsm's algorithms, I see a couple of problems that exist: 1) RosAsm only optimizes conditional jumps. It does not optimize JMP instructions or any other variable-length instructions whose size can change based on various factors. 2) There is a bug in the RosAsm algorithm -- it doesn't always optimizes even when it has the chance (e.g., the example given earlier). 3) RosAsm only applies the branch displacement optimization algorithm to condition jumps whose target label is a local symbol. If you use a regular target symbol, RosAsm always encodes this as a large displacement instruction. This makes no sense at all. If you can optimize one label type, why not both? Just a bad design decision on Rene's part. Cheers, Randy Hyde
From: anonymous on 11 Sep 2005 17:35 Randy wrote: > Elsewhere in this thread, Rene has mentioned that he uses a "start > long, optimize short" algorithm for branch displacement optimization. > This scheme has the advantage that if you ever prematurely stop during > the optimization process, the program will still run correctly. > However, this algorithm often produces results that are not as optimal > as they could be. <snipped> > Back to RosAsm's algorithms, I see a couple of problems that exist: > > 1) RosAsm only optimizes conditional jumps. It does not optimize JMP > instructions or any other variable-length instructions whose size can > change based on various factors. > > 2) There is a bug in the RosAsm algorithm -- it doesn't always > optimizes even when it has the chance (e.g., the example given > earlier). > > 3) RosAsm only applies the branch displacement optimization algorithm > to condition jumps whose target label is a local symbol. If you use a > regular target symbol, RosAsm always encodes this as a large > displacement instruction. This makes no sense at all. If you can > optimize one label type, why not both? Just a bad design decision on > Rene's part. Who was this written for/to? greetz Jakob Wrigley
From: Paul Dunn on 12 Sep 2005 12:11
Betov wrote: [Some stuff that's not really relevant] Betov, did you not read my last question? I'm interested in your definition of "emulation", so I understand what you mean by it. I don't think it's the same definition that I use. Can you explain? |