Prev: x86 instruction set usage-difference between windows 95 and windows xp ?
Next: peter-bochs-debugger is a GUI debugger for bochs
From: BGB / cr88192 on 28 Mar 2010 01:53 "cr88192" <cr88192(a)hotmail.com> wrote in message news:hofus4$a0r$1(a)news.albasani.net... > well, this was a recent argument on comp.compilers, but I figured it may > make some sense in a "freer" context. > well, a status update: 1.94 MB/s is the speed which can be gained with "normal" operation (textual interface, preprocessor, jump optimization, ...); 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and forces single-pass assembly. 10MB/s (analogue) can be gained by using a direct binary interface (newly added). in the case of this mode, most of the profile time goes into a few predicate functions, and also the function for emitting opcode bytes. somehow, I don't think it is likely to be getting that much faster. stated another way: 643073 opcodes/second, or about 1.56us/op. calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU = 2.31 GHz). basically, I have a personal optimization hueristic: when the top item reported by the profiler is the entry point to a switch statement, it is not likely that all that many more optimizations are gained (the so-called "switch limit"). a variant of this has happened in this case. in the binary mode, the test fragment is pre-parsed into an array of struct-pointers, and these structs are used to drive the assembler internals (with pre-resolved opcode numbers, ...). the fragment has 462 ops and manages to be re-assembled 41758 times before the timer expires (timer expire is 30s, so 1391 re-assembles/second). to get any faster would likely involve sidestepping the assembler as well (such as using a big switch and emitting bytes), but this is not something I am going to test (would make about as much sense as benchmarking it against memcpy or similar, since yes, memcpy is faster, but no, it is not an assembler...). so, at the moment, this means an approx 5x speed difference between the fastest and the slowest modes. I am not really sure if this is all that drastic of a difference... or such...
From: Rod Pemberton on 28 Mar 2010 03:16 "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message news:homqsi$s25$1(a)news.albasani.net... > [...] > 10MB/s (analogue) can be gained by using a direct binary interface (newly > added). > in the case of this mode, most of the profile time goes into a few predicate > functions, and also the function for emitting opcode bytes. somehow, I don't > think it is likely to be getting that much faster. > A few years ago, I posted the link below for large single file programs (talking to you...). I'm not sure if you ever looked their file sizes, but the largest two were gcc as a single file and an ogg encoder as a single file, at 3.2MB and 1.7MB respectively. Those are probably the largest single file C programs you'll see. It's possible, even likely, some multi-file project, say the Linux kernel etc., is larger. But, 10MB/s should still be very good for most uses. But, there's no reason to stop there, if you've got the time! http://people.csail.mit.edu/smcc/projects/single-file-programs/ > stated another way: 643073 opcodes/second, or about 1.56us/op. > calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU = > 2.31 GHz). BTW, what brand of cpu, and what number of cores are being used? > to get any faster would likely involve sidestepping the assembler as well > (such as using a big switch and emitting bytes), but this is not something I > am going to test (would make about as much sense as benchmarking it against > memcpy or similar, since yes, memcpy is faster, but no, it is not an > assembler...). OpenWatcom is (or was) one of the fastest C compilers I've used. It skipped emitting assembly. Given the speed, I'm sure they did much more than that... It might provide a reference point for a speed comparison. I haven't used more recent versions (I'm using v1.3). So, I'm assuming the speed is still there. Rod Pemberton
From: Robbert Haarman on 28 Mar 2010 03:41 On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote: > > "cr88192" <cr88192(a)hotmail.com> wrote in message > news:hofus4$a0r$1(a)news.albasani.net... > > well, a status update: > 1.94 MB/s is the speed which can be gained with "normal" operation (textual > interface, preprocessor, jump optimization, ...); > 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and > forces single-pass assembly. > > > 10MB/s (analogue) can be gained by using a direct binary interface (newly > added). > in the case of this mode, most of the profile time goes into a few predicate > functions, and also the function for emitting opcode bytes. somehow, I don't > think it is likely to be getting that much faster. > > stated another way: 643073 opcodes/second, or about 1.56us/op. > calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU = > 2.31 GHz). To provide another data point: First, some data from /proc/cpuinfo: model name : AMD Athlon(tm) Dual Core Processor 5050e cpu MHz : 2600.000 cache size : 512 KB bogomips : 5210.11 I did a quick test using the Alchemist code generation library. The instruction sequence I generated is: 00000000 33C0 xor eax,eax 00000002 40 inc eax 00000003 33DB xor ebx,ebx 00000005 83CB2A or ebx,byte +0x2a 00000008 CD80 int 0x80 for a total of 10 bytes. Doing this 100000000 (a hundred million) times takes about 4.7 seconds. Using the same metrics that you provided, that is: About 200 MB/s About 100 million opcodes generated per second About 24 CPU clock cycles per opcode generated Cheers, Bob
From: Rod Pemberton on 28 Mar 2010 04:22 "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message news:20100328074138.GA3467(a)yoda.inglorion.net... > > First, some data from /proc/cpuinfo: > > model name : AMD Athlon(tm) Dual Core Processor 5050e > cpu MHz : 2600.000 > cache size : 512 KB > bogomips : 5210.11 > Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect you listed it for _one_ core, as /proc/cpuinfo does. Look in /var/log/messages to see if your total is twice. It should say both cores are activated and list the total. I'm really not sure what anyone could use BogoMips for... Rod Pemberton
From: Branimir Maksimovic on 28 Mar 2010 04:58
On Sun, 28 Mar 2010 04:22:48 -0400 "Rod Pemberton" <do_not_have(a)havenone.cmm> wrote: > "Robbert Haarman" <comp.lang.misc(a)inglorion.net> wrote in message > news:20100328074138.GA3467(a)yoda.inglorion.net... > > > > First, some data from /proc/cpuinfo: > > > > model name : AMD Athlon(tm) Dual Core Processor 5050e > > cpu MHz : 2600.000 > > cache size : 512 KB > > bogomips : 5210.11 > > > > Unrelated FYI, your BogoMips should be twice that for that cpu. I > suspect you listed it for _one_ core, as /proc/cpuinfo does. Look in > /var/log/messages to see if your total is twice. It should say both > cores are activated and list the total. I'm really not sure what > anyone could use BogoMips for... > Well, actually Linux shows that bogomips depending on bios feagures not real feagures. For example if you put 400mhz FSB and multiplier 8 it will not show 3.2GHZ but 3.6 if you multiplier max is 9. For same reason if you put 400mhz auto multiplier and speedstep enabled it will show 2ghz when multiplier iz 6 and 3 ghz when multiplier is 9, but actually clock is 2.4GHZ,3.6HZ not 2GHZ/3GHZ as shown. Greets! -- http://maxa.homedns.org/ Sometimes online sometimes not |