Release of RosAsm V.2.025a [ASM]

Prev: Iczelion's tutorials revisited.
Next: Borland 5.5 Command line C/C++ compiler

From: Betov on 10 Sep 2005 04:11

"sevagK" <kahlinor(a)yahoo.com> ýcrivait news:1126299080.429453.180180
@g49g2000cwa.googlegroups.com:

> On another note, congratulations on finally implementing the jump table
> optimizations after Randall showed that RosAsm lacked this ability and
> suffered because of it (even if you didn't do a complete job of it, we
> have come to expect half finished features in RosAsm).
> You can thank Randy for influencing yet another inovation for RosAsm
> which may soon be able to match features with the other assemblers.
> By the way, no one is going to buy your nonesence that this was
> "planned
> since day one."

You should stop smoking the curtains.

Betov.

< http://rosasm.org >

From: randyhyde@earthlink.net on 11 Sep 2005 10:12

Betov wrote:
> 4) Yes, of course, in the "Multi-Pass-Emulation", all of
> the Tables (and the Code Section, and all) are entirely
> compacted, before scanning the Map Table again, in the
> Main Loop, which is a trivial as:
>
> JmpsOptimize:
> call ScanShortenJmpsTable
>
> While D$NumberOfLongToShort <> 0
> call CompactLabelListAndCodeRef
> call CompactCodeList
>
> call CompactShortenJmpsTable
>
> call ScanShortenJmpsTable
> End_While
> ret
>

Just for your information, this *is* a multi-pass algorithm. Your
argument that you're doing a single-pass is incorrect.

Indeed, if I interpret the above correctly, this is a typical O(n**2)
algorithm with the limitation (and not a bad one, I might point out)
that once a branch is shortened, it is never again considered (that is,
you won't lengthen it once it has been shortened). Fast, but doesn't
guaranteed optimal code (and no *tractible* algorithm will guaranteed
optimal code, so again, this limitation isn't that bad).
Cheers,
Randy Hyde

From: randyhyde@earthlink.net on 11 Sep 2005 14:10

Elsewhere in this thread, Rene has mentioned that he uses a "start
long, optimize short" algorithm for branch displacement optimization.
This scheme has the advantage that if you ever prematurely stop during
the optimization process, the program will still run correctly.
However, this algorithm often produces results that are not as optimal
as they could be.

Consider the following (HLA) code:

program t;
#include ("stdlib.hhf")

static
lblCnt :uns32;

begin t;

B0: jmp L0;
B1: jmp L1;
B2: jmp L2;
B3: jmp L3;
B4: jmp L4;
B5: jmp L5;
B6: jmp L6;
B7: jmp L7;
B8: jmp L8;
B9: jmp L9;
B10: jmp L10;
B11: jmp L11;
B12: jmp L12;
B13: jmp L13;
B14: jmp L14;
B15: jmp L15;
B16: jmp L16;
B17: jmp L17;
B18: jmp L18;
B19: jmp L19;
B20: jmp L20;
B21: jmp L21;
B22: jmp L22;
B23: jmp L23;
B24: jmp L24;
B25: jmp L25;

// Magic statement

jmp L26;

L0: jmp B0;
L1: jmp B1;
L2: jmp B2;
L3: jmp B3;
L4: jmp B4;
L5: jmp B5;
L6: jmp B6;
L7: jmp B7;
L8: jmp B8;
L9: jmp B9;
L10: jmp B10;
L11: jmp B11;
L12: jmp B12;
L13: jmp B13;
L14: jmp B14;
L15: jmp B15;
L16: jmp B16;
L17: jmp B17;
L18: jmp B18;
L19: jmp B19;
L20: jmp B20;
L21: jmp B21;
L22: jmp B22;
L23: jmp B23;
L24: jmp B24;
L25: jmp B25;

L26:

end t;

If all of these statements are encoded as short jumps (even several
more, in fact), then all the branches are within range. However, if
they are all encoded as far branches, then no single displacement will
be within the +/- 128 byte range allowed by a single-byte displacement.

The problem with starting long and optimizing short is that an
assembler that scans through these instructions encoded long will never
see the opportunity to optimize these jumps -- they are all out of
range as far as the assembler is concerned.

However, an assembler that "starts short, extends long (as necessary)"
will produce the shortest form of this instruction code. This
particular code is "magic" insofaras the "optimal/unoptimal" versions
of the code can be selected by changing just one statement (the "magic"
statement). Make this jump a short jump (or replace it by two NOPs
which would take the same amount of space as a short jump) and all the
instructions get encoded in the short form by a typical "start long,
optimize short" assembler. Put in a long jump, and the assembler will
probably leave them all long (which is *not* optimized, obviously).

RosAsm uses a "start long, optimize short" branch displacement
optimizer, but interesting enough it won't even produce an optimal
version of this program if you sustitute a short branch for the jump.
Problem #1 is that RosAsm only optimizes conditional jumps, it does not
also optimize absolute jumps. But even if you change all these absolute
jumps to conditional jumps, RosAsm's arithmetic seems to be off and it
will not produce optimal code until you subtract a small number from
the displacements. For example, the RosAsm program:

Main:

B0: jc L0>>;
B1: jc L1>>;
B2: jc L2>>;
B3: jc L3>>;
B4: jc L4>>;
B5: jc L5>>;
B6: jc L6>>;
B7: jc L7>>;
B8: jc L8>>;
B9: jc L9>>;
C0: jc M0>>;
C1: jc M1>>;
C2: jc M2>>;
C3: jc M3>>;
C4: jc M4>>;
C5: jc M5>>;
C6: jc M6>>;
C7: jc M7>>;
C8: jc M8>>;
C9: jc M9>>;
D0: jc N0>>;

;jc N6>

L0: jc B0<<;
L1: jc B1<<;
L2: jc B2<<;
L3: jc B3<<;
L4: jc B4<<;
L5: jc B5<<;
L6: jc B6<<;
L7: jc B7<<;
L8: jc B8<<;
L9: jc B9<<;
M0: jc C0<<;
M1: jc C1<<;
M2: jc C2<<;
M3: jc C3<<;
M4: jc C4<<;
M5: jc C5<<;
M6: jc C6<<;
M7: jc C7<<;
M8: jc C8<<;
M9: jc C9<<;
N0: jc D0<<;

N6:
push 0
call 'Kernel32.ExitProcess'

Produces the following object code (disassembled by dumpbin):

00403000: 0F 82 7B 00 00 00 jb 00403081
00403006: 0F 82 7B 00 00 00 jb 00403087
0040300C: 0F 82 7B 00 00 00 jb 0040308D
00403012: 0F 82 7B 00 00 00 jb 00403093
00403018: 0F 82 7B 00 00 00 jb 00403099
0040301E: 0F 82 7B 00 00 00 jb 0040309F
00403024: 0F 82 7B 00 00 00 jb 004030A5
0040302A: 0F 82 7B 00 00 00 jb 004030AB
00403030: 0F 82 7B 00 00 00 jb 004030B1
00403036: 0F 82 7B 00 00 00 jb 004030B7
0040303C: 0F 82 7B 00 00 00 jb 004030BD
00403042: 0F 82 7B 00 00 00 jb 004030C3
00403048: 0F 82 7B 00 00 00 jb 004030C9
0040304E: 0F 82 7B 00 00 00 jb 004030CF
00403054: 0F 82 7B 00 00 00 jb 004030D5
0040305A: 0F 82 7B 00 00 00 jb 004030DB
00403060: 0F 82 7B 00 00 00 jb 004030E1
00403066: 0F 82 7B 00 00 00 jb 004030E7
0040306C: 0F 82 7B 00 00 00 jb 004030ED
00403072: 0F 82 7B 00 00 00 jb 004030F3
00403078: 0F 82 7B 00 00 00 jb 004030F9
0040307E: 0F 82 7C FF FF FF jb 00403000
00403084: 0F 82 7C FF FF FF jb 00403006
0040308A: 0F 82 7C FF FF FF jb 0040300C
00403090: 0F 82 7C FF FF FF jb 00403012
00403096: 0F 82 7C FF FF FF jb 00403018
0040309C: 0F 82 7C FF FF FF jb 0040301E
004030A2: 0F 82 7C FF FF FF jb 00403024
004030A8: 0F 82 7C FF FF FF jb 0040302A
004030AE: 0F 82 7C FF FF FF jb 00403030
004030B4: 0F 82 7C FF FF FF jb 00403036
004030BA: 0F 82 7C FF FF FF jb 0040303C
004030C0: 0F 82 7C FF FF FF jb 00403042
004030C6: 0F 82 7C FF FF FF jb 00403048
004030CC: 0F 82 7C FF FF FF jb 0040304E
004030D2: 0F 82 7C FF FF FF jb 00403054
004030D8: 0F 82 7C FF FF FF jb 0040305A
004030DE: 0F 82 7C FF FF FF jb 00403060
004030E4: 0F 82 7C FF FF FF jb 00403066
004030EA: 0F 82 7C FF FF FF jb 0040306C
004030F0: 0F 82 7C FF FF FF jb 00403072
004030F6: 0F 82 7C FF FF FF jb 00403078
004030FC: 6A 00 push 0
004030FE: FF 15 30 10 40 00 call dword ptr ds:[00401030h]

As you can see, the branch displacements are all well within the +/-
128 byte range, yet RosAsm *still* uses long displacements ($7b and
-$7b).

However, if I delete two JC instructions (at labels D0 and N0), then I
get the following:

00403000: 72 26 jb 00403028
00403002: 72 26 jb 0040302A
00403004: 72 26 jb 0040302C
00403006: 72 26 jb 0040302E
00403008: 72 26 jb 00403030
0040300A: 72 26 jb 00403032
0040300C: 72 26 jb 00403034
0040300E: 72 26 jb 00403036
00403010: 72 26 jb 00403038
00403012: 72 26 jb 0040303A
00403014: 72 26 jb 0040303C
00403016: 72 26 jb 0040303E
00403018: 72 26 jb 00403040
0040301A: 72 26 jb 00403042
0040301C: 72 26 jb 00403044
0040301E: 72 26 jb 00403046
00403020: 72 26 jb 00403048
00403022: 72 26 jb 0040304A
00403024: 72 26 jb 0040304C
00403026: 72 26 jb 0040304E
00403028: 72 D6 jb 00403000
0040302A: 72 D6 jb 00403002
0040302C: 72 D6 jb 00403004
0040302E: 72 D6 jb 00403006
00403030: 72 D6 jb 00403008
00403032: 72 D6 jb 0040300A
00403034: 72 D6 jb 0040300C
00403036: 72 D6 jb 0040300E
00403038: 72 D6 jb 00403010
0040303A: 72 D6 jb 00403012
0040303C: 72 D6 jb 00403014
0040303E: 72 D6 jb 00403016
00403040: 72 D6 jb 00403018
00403042: 72 D6 jb 0040301A
00403044: 72 D6 jb 0040301C
00403046: 72 D6 jb 0040301E
00403048: 72 D6 jb 00403020
0040304A: 72 D6 jb 00403022
0040304C: 72 D6 jb 00403024
0040304E: 72 D6 jb 00403026
00403050: 6A 00 push 0
00403052: FF 15 30 10 40 00 call dword ptr ds:[00401030h]

And now the assembler optimizes the branches to two bytes each (from
six). Also note that the displacement changes from $7b to $26 (and
their negatives). The difference between $7b and $26 is considerable.
We should be able to stick a *ton* of additional instructions in here
without forcing these jumps to go from two to six bytes each! Clearly,
RosAsm is not doing a very good job of optimization here.

I'm picking on RosAsm, but the truth is that this problem exists for
*any* assembler that uses a "start large, optimize small" heuristic.
Even MASM is not free from this problem (though it does a better job of
optimization than RosAsm does).

One *big* advantage of "start large, optimize small" is that you can
stop the optimization process at any time and *still* have a working
program. Because branch displacement optimization is inherently an
NP-Complete problem, there exist certain cases where the number of
iterations one would have to execute to optimize the code would be far
too great. An assembler that starts large and optimizes small could
have a loop counter and it could stop the optimization process when the
loop exceeds some value. This could limit the number of iterations
while still providing a reasonable level of optimization (do keep in
mind that the majority of the optimizations are going to occur on the
*first* pass of this algorithm).

In theory, a "start small, work larger" algorithm could be forced to
run (almost) forever optimizing some large programs. In practice few
optimization algorithms implement a truly NP-Complete algorithm and as
such, they complete in polynomial time (meaning even if they work
slowly, they are tractible solutions). They don't guarantee optimal
results, but they still tend to produce better results than "start
large, optimize small" algorithms and usually they run much faster
because most of the branches in a typical program *are* small.

Back to RosAsm's algorithms, I see a couple of problems that exist:

1) RosAsm only optimizes conditional jumps. It does not optimize JMP
instructions or any other variable-length instructions whose size can
change based on various factors.

2) There is a bug in the RosAsm algorithm -- it doesn't always
optimizes even when it has the chance (e.g., the example given
earlier).

3) RosAsm only applies the branch displacement optimization algorithm
to condition jumps whose target label is a local symbol. If you use a
regular target symbol, RosAsm always encodes this as a large
displacement instruction. This makes no sense at all. If you can
optimize one label type, why not both? Just a bad design decision on
Rene's part.

Cheers,
Randy Hyde

From: anonymous on 11 Sep 2005 17:35

Randy wrote:
> Elsewhere in this thread, Rene has mentioned that he uses a "start
> long, optimize short" algorithm for branch displacement optimization.
> This scheme has the advantage that if you ever prematurely stop during
> the optimization process, the program will still run correctly.
> However, this algorithm often produces results that are not as optimal
> as they could be.

<snipped>

> Back to RosAsm's algorithms, I see a couple of problems that exist:
>
> 1) RosAsm only optimizes conditional jumps. It does not optimize JMP
> instructions or any other variable-length instructions whose size can
> change based on various factors.
>
> 2) There is a bug in the RosAsm algorithm -- it doesn't always
> optimizes even when it has the chance (e.g., the example given
> earlier).
>
> 3) RosAsm only applies the branch displacement optimization algorithm
> to condition jumps whose target label is a local symbol. If you use a
> regular target symbol, RosAsm always encodes this as a large
> displacement instruction. This makes no sense at all. If you can
> optimize one label type, why not both? Just a bad design decision on
> Rene's part.

Who was this written for/to?

greetz
Jakob Wrigley

From: Paul Dunn on 12 Sep 2005 12:11

Betov wrote:

[Some stuff that's not really relevant]

Betov, did you not read my last question? I'm interested in your definition
of "emulation", so I understand what you mean by it. I don't think it's the
same definition that I use.

Can you explain?

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: Iczelion's tutorials revisited.
Next: Borland 5.5 Command line C/C++ compiler