status update 1 (Re: assembler speed...) [ASM]

Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs

From: BGB / cr88192 on 28 Mar 2010 01:53

"cr88192" <cr88192(a)hotmail.com> wrote in message
news:hofus4$a0r$1(a)news.albasani.net...
> well, this was a recent argument on comp.compilers, but I figured it may
> make some sense in a "freer" context.
>

well, a status update:
1.94 MB/s is the speed which can be gained with "normal" operation (textual
interface, preprocessor, jump optimization, ...);
5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
forces single-pass assembly.

10MB/s (analogue) can be gained by using a direct binary interface (newly
added).
in the case of this mode, most of the profile time goes into a few predicate
functions, and also the function for emitting opcode bytes. somehow, I don't
think it is likely to be getting that much faster.

stated another way: 643073 opcodes/second, or about 1.56us/op.
calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
2.31 GHz).

basically, I have a personal optimization hueristic:
when the top item reported by the profiler is the entry point to a switch
statement, it is not likely that all that many more optimizations are gained
(the so-called "switch limit"). a variant of this has happened in this case.

in the binary mode, the test fragment is pre-parsed into an array of
struct-pointers, and these structs are used to drive the assembler internals
(with pre-resolved opcode numbers, ...).

the fragment has 462 ops and manages to be re-assembled 41758 times before
the timer expires (timer expire is 30s, so 1391 re-assembles/second).

to get any faster would likely involve sidestepping the assembler as well
(such as using a big switch and emitting bytes), but this is not something I
am going to test (would make about as much sense as benchmarking it against
memcpy or similar, since yes, memcpy is faster, but no, it is not an
assembler...).

so, at the moment, this means an approx 5x speed difference between the
fastest and the slowest modes.

I am not really sure if this is all that drastic of a difference...

or such...

From: Rod Pemberton on 28 Mar 2010 03:16

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:homqsi$s25$1(a)news.albasani.net...
> [...]
> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few
predicate
> functions, and also the function for emitting opcode bytes. somehow, I
don't
> think it is likely to be getting that much faster.
>

A few years ago, I posted the link below for large single file programs
(talking to you...). I'm not sure if you ever looked their file sizes, but
the largest two were gcc as a single file and an ogg encoder as a single
file, at 3.2MB and 1.7MB respectively. Those are probably the largest
single file C programs you'll see. It's possible, even likely, some
multi-file project, say the Linux kernel etc., is larger. But, 10MB/s
should still be very good for most uses. But, there's no reason to stop
there, if you've got the time!

http://people.csail.mit.edu/smcc/projects/single-file-programs/

> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU
=
> 2.31 GHz).

BTW, what brand of cpu, and what number of cores are being used?

> to get any faster would likely involve sidestepping the assembler as well
> (such as using a big switch and emitting bytes), but this is not something
I
> am going to test (would make about as much sense as benchmarking it
against
> memcpy or similar, since yes, memcpy is faster, but no, it is not an
> assembler...).

OpenWatcom is (or was) one of the fastest C compilers I've used. It skipped
emitting assembly. Given the speed, I'm sure they did much more than
that... It might provide a reference point for a speed comparison. I
haven't used more recent versions (I'm using v1.3). So, I'm assuming the
speed is still there.

Rod Pemberton

From: Robbert Haarman on 28 Mar 2010 03:41

On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
>
> "cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hofus4$a0r$1(a)news.albasani.net...
>
> well, a status update:
> 1.94 MB/s is the speed which can be gained with "normal" operation (textual
> interface, preprocessor, jump optimization, ...);
> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
> forces single-pass assembly.
>
>
> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few predicate
> functions, and also the function for emitting opcode bytes. somehow, I don't
> think it is likely to be getting that much faster.
>
> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
> 2.31 GHz).

To provide another data point:

First, some data from /proc/cpuinfo:

model name : AMD Athlon(tm) Dual Core Processor 5050e
cpu MHz : 2600.000
cache size : 512 KB
bogomips : 5210.11

I did a quick test using the Alchemist code generation library. The
instruction sequence I generated is:

00000000 33C0 xor eax,eax
00000002 40 inc eax
00000003 33DB xor ebx,ebx
00000005 83CB2A or ebx,byte +0x2a
00000008 CD80 int 0x80

for a total of 10 bytes. Doing this 100000000 (a hundred million) times
takes about 4.7 seconds.

Using the same metrics that you provided, that is:

About 200 MB/s
About 100 million opcodes generated per second
About 24 CPU clock cycles per opcode generated

Cheers,

Bob

From: Rod Pemberton on 28 Mar 2010 04:22

"Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
news:20100328074138.GA3467(a)yoda.inglorion.net...
>
> First, some data from /proc/cpuinfo:
>
> model name : AMD Athlon(tm) Dual Core Processor 5050e
> cpu MHz : 2600.000
> cache size : 512 KB
> bogomips : 5210.11
>

Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect
you listed it for _one_ core, as /proc/cpuinfo does. Look in
/var/log/messages to see if your total is twice. It should say both cores
are activated and list the total. I'm really not sure what anyone could use
BogoMips for...

Rod Pemberton

From: Branimir Maksimovic on 28 Mar 2010 04:58

On Sun, 28 Mar 2010 04:22:48 -0400
"Rod Pemberton" <do_not_have(a)havenone.cmm> wrote:

> "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message
> news:20100328074138.GA3467(a)yoda.inglorion.net...
> >
> > First, some data from /proc/cpuinfo:
> >
> > model name : AMD Athlon(tm) Dual Core Processor 5050e
> > cpu MHz : 2600.000
> > cache size : 512 KB
> > bogomips : 5210.11
> >
>
> Unrelated FYI, your BogoMips should be twice that for that cpu. I
> suspect you listed it for _one_ core, as /proc/cpuinfo does. Look in
> /var/log/messages to see if your total is twice. It should say both
> cores are activated and list the total. I'm really not sure what
> anyone could use BogoMips for...
>

Well, actually Linux shows that bogomips depending on
bios feagures not real feagures. For example
if you put 400mhz FSB and multiplier 8
it will not show 3.2GHZ but 3.6 if you multiplier
max is 9.
For same reason if you put 400mhz auto multiplier
and speedstep enabled it will show 2ghz when multiplier
iz 6 and 3 ghz when multiplier is 9,
but actually clock is 2.4GHZ,3.6HZ not 2GHZ/3GHZ
as shown.

Greets!

--
http://maxa.homedns.org/

Sometimes online sometimes not

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs