Prev: Lolling at programmers, how many ways are there to create a bitmask? ;) :)
Next: Lolling at programmers, how many ways are there to create abitmask ? ;) :)
From: James Harris on 30 May 2010 12:01 On 27 May, 19:11, Frank Kotler <fbkot...(a)myfairpoint.net> wrote: .... > What does this do? As I read it, if you haven't got "PProPII" defined, > it doesn't do squat. In fact, you don't appear to do "cpuid" at all, if > this isn't defined. You probably do want "cpuid", if your machine > supports it - using Tasm, ya never know. :) > > I sympathize with your problems. My attempts to time anything have been > "inconclusive". I'm not sure I really follow the ongoing discussion but on the above point I also found very inconsistent results with timing anything. They didn't get resolved until I ditched cpuid. It itself took 124 cycles on one machine and appeared to contribute to the inconsistencies, if not cause them. I've had far better results from using xchg reg, mem on Intel. (On AMD rdtscp can be used instead of rdtsc.) > If you get this thing working, time "das" for me! I've > got some code... very similar to what you've got here... which "works", > I guess, but gives some weird results. "das" is all over the place. > "push eax"/"pop eax" executes faster if you do it twice than if you do > it once. Things like that make me wonder if there's a bug in the code > that I haven't found, or if my PIV is just weird. Try xchg reg, mem. Here are my timings for das on a Pentium M. Because das really only makes sense following a subtraction I timed sub al, 1 das and got the following results Reps Cycles ---- ------ 1 1 2 3 3 3 4 4 5 5 6 5 7 6 8 8 9 9 10 8 I'd say that works out about 1 cycle per sub-das pair. Timings with das on its own show up to three can execute in one cycle on this CPU, decreasing slightly as higher numbers are attempted. But such numbers of consecutive das operations are meaningless anyway! James
From: Rob on 30 May 2010 18:35 James Harris wrote: > On 27 May, 19:11, Frank Kotler <fbkot...(a)myfairpoint.net> wrote: > > .... > > I'm not sure I really follow the ongoing discussion but on the above > point I also found very inconsistent results with timing anything. > They didn't get resolved until I ditched cpuid. It itself took 124 > cycles on one machine and appeared to contribute to the > inconsistencies, if not cause them. > > I've had far better results from using xchg reg, mem on Intel. (On AMD > rdtscp can be used instead of rdtsc.) > >> If you get this thing working, time "das" for me! I've >> got some code... very similar to what you've got here... which "works", >> I guess, but gives some weird results. "das" is all over the place. >> "push eax"/"pop eax" executes faster if you do it twice than if you do >> it once. Things like that make me wonder if there's a bug in the code >> that I haven't found, or if my PIV is just weird. > > Try xchg reg, mem. Here are my timings for das on a Pentium M. Because > das really only makes sense following a subtraction I timed > > sub al, 1 > das > > and got the following results > > Reps Cycles > ---- ------ > 1 1 > 2 3 > 3 3 > 4 4 > 5 5 > 6 5 > 7 6 > 8 8 > 9 9 > 10 8 > > I'd say that works out about 1 cycle per sub-das pair. Timings with > das on its own show up to three can execute in one cycle on this CPU, > decreasing slightly as higher numbers are attempted. But such numbers > of consecutive das operations are meaningless anyway! > > James I've been playing with a timing framework off and on (sort of a port of Yodel from asmcommunity.net) using the Linux timer syscall. It's still got a bit of work left on it, but if anyone wants to take a look (and comments/improvements/bug identifications are more than welcome) I have posted it at http://70.24.1.178/Forums/ala/timer.tar.bz2 I included the executable, but if the directory structure is maintained (although you might have to move the linux.inc to your local path for FASM). Increasing the time delay for calculating the CPU could improve the accuracy too though. It seems though, that the P4 is kind of the odd machine out - the AMD handles it much better and seems to have fairly consistent timings even with varying input to das. Here's the results from my PIV: > ## Test parameters: 10000000 iterations. > > /] Running performance tests: Intel(R) Pentium(R) 4 CPU 2.40GHz Processor @ 2400 MHz. > Reference Procedure timing took 0.050498500s = 12.119 cycles/iteration > das test --> 0.371625844s = 89.190 cycles/iteration > das2 test --> 0.444781053s = 106.747 cycles/iteration > Ultrano Test --> 0.395520856s = 94.925 cycles/iteration And I tried it on an AMD: > /] Running performance tests: AMD Athlon(tm) XP 2100+ Processor @ 1737 MHz. > Reference Procedure timing took 0.034755937s = 6.037 cycles/iteration > das test --> 0.035735015s = 6.207 cycles/iteration > das2 test --> 0.034670268s = 6.022 cycles/iteration > Ultrano Test --> 0.121370968s = 21.082 cycles/iteration Here's the code I timed: > time1_name db "das test",0 > align 16 > time1: > mov eax,10 > das > retn > > time2_name db "das2 test",0 > align 16 > time2: > mov eax,10 shl 8 > das > retn > > time3_name db "Ultrano Test",0 > align 16 > time3: > mov ax,10 shl 8 > aad > aam > mov al,8 > sub al,3 > xchg ah,al > aad > retn
From: Nathan on 31 May 2010 01:47 On May 30, 6:35 pm, Rob <junkma...(a)lavabit.com> wrote: > > I've been playing with a timing framework off and on (sort of a port of > Yodel from asmcommunity.net) using the Linux timer syscall. It's still > got a bit of work left on it, but if anyone wants to take a look (and > comments/improvements/bug identifications are more than welcome) I have > posted it athttp://70.24.1.178/Forums/ala/timer.tar.bz2 > I included the executable, but if the directory structure is maintained > (although you might have to move the linux.inc to your local path for > FASM). Increasing the time delay for calculating the CPU could improve > the accuracy too though. > Interesting code. Here are results from an Atom-powered netbook: ######## Yodel (sort of) Linux Port version 0.1, 2010/03/20 ## Calculating clockspeed... (Your computer might temporarily appear frozen as process priority is being boosted to level 99) ## Test parameters: 10000000 iterations. /] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz Processor @ 1596 MHz. Reference Procedure timing took 0.088939013s = 14.194 cycles/ iteration das test --> 0.075400169s = 12.033 cycles/ iteration das2 test --> 0.076278460s = 12.174 cycles/ iteration Ultrano Test --> 0.207771496s = 33.160 cycles/ iteration Nathan.
From: Branimir Maksimovic on 31 May 2010 07:25 On Sun, 30 May 2010 05:16:03 -0400 Frank Kotler <fbkotler(a)myfairpoint.net> wrote: > Okay... It's *your* code (with Nathan's changes)... > > ; fasm myprog.asm > ; > ; from Branimir Maksimovic > ; bugfixes from Nathan Baker > ; cruft from fbk :) > > format ELF executable > > segment writeable executable > > entry $ > > ; five bytes here changes the timing > ;mov ebx, xtbl > > ;nop > ;nop > ;nop > ;nop > ;nop > ;nop ; six bytes changes it back > > mov ecx,16 > l1: > push ecx > > ; serialize CPU and get start time > cpuid > rdtsc > push edx > push eax > > ; code to be timed > ;-------------- > ;das > ;push eax > ;pop eax > ;push eax > ;pop eax > ;-------------- > > ; serialize cpu and get end time > cpuid > rdtsc > > ; calculate difference > pop ebx > sub eax, ebx > pop ecx > sub edx, ecx > > ; convert number to text > mov edi, ascbuf > call u64toha > > ; print it > mov ecx, ascbuf > mov edx, 17 > mov ebx, 1 > mov eax, 4 > int 80h > > ; do more > pop ecx > loop l1 > > exit: > mov eax, 1 > mov ebx,0 > int 80h > > xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \ > 43h,44h,45h,46h > > ; I changed the name of this - 'd' implied "decimal"... > u64toha: > add edi, 15 > mov ebx,xtbl > mov cl, 16 > std > l2: > mov ch,al > and al,0xf > xlatb > stosb > mov al,ch > ; shrd edx,eax,4 > shrd eax,edx,4 > shr edx, 4 > dec cl > jz e1 > ; mov byte[edi], ',' > ; inc edi > jmp l2 > > e1: > cld > ret > > ascbuf db 17 dup (0xa) > ;--------------------------- > > My output from this is "21C" (with a bunch of zeros in front). With > the "five byte padding" uncommented, it goes to "220". All we're > "timing" is push edx/push eax/cpuid... is cpuid sensitive to > alignment??? I would expect that if five bytes changes it, one byte > would, too - but it doesn't (your mileage may vary)... bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm flat assembler version 1.68 (16384 kilobytes memory) 2 passes, 236 bytes. bmaxa(a)maxa:~/fasm/test$ ./ttest 000000000000017A 0000000000000183 0000000000000183 0000000000000183 000000000000017A 000000000000017A 000000000000017A 000000000000017A 000000000000017A 000000000000017A 000000000000017A 0000000000000183 000000000000017A 000000000000017A 0000000000000183 000000000000017A bmaxa(a)maxa:~/fasm/test$ cat ttest.asm ; fasm myprog.asm ; ; from Branimir Maksimovic ; bugfixes from Nathan Baker ; cruft from fbk :) format ELF executable segment writeable executable entry $ mov ebx, xtbl nop nop nop nop nop nop mov ecx,16 l1: push ecx cpuid rdtsc push edx push eax das push eax pop eax push eax pop eax cpuid rdtsc pop ebx sub eax, ebx pop ecx sub edx, ecx mov edi, ascbuf call u64toha mov ecx, ascbuf mov edx, 17 mov ebx, 1 mov eax, 4 int 80h pop ecx loop l1 exit: mov eax, 1 mov ebx,0 int 80h xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \ 43h,44h,45h,46h u64toha: add edi, 15 mov ebx,xtbl mov cl, 16 std l2: mov ch,al and al,0xf xlatb stosb mov al,ch ; shrd edx,eax,4 shrd eax,edx,4 shr edx, 4 dec cl jz e1 ; mov byte[edi], ',' ; inc edi jmp l2 e1: cld ret ascbuf db 17 dup (0xa) bmaxa(a)maxa:~/fasm/test$ > Best, > Frank > Cheers! -- http://maxa.homedns.org/ Sometimes online sometimes not Svima je "dozvoljeno" biti idiot i > mrak, ali samo neki to odaberu,
From: Branimir Maksimovic on 31 May 2010 07:34
On Mon, 31 May 2010 13:25:29 +0200 Branimir Maksimovic <bmaxa(a)hotmail.com> wrote: > On Sun, 30 May 2010 05:16:03 -0400 > Frank Kotler <fbkotler(a)myfairpoint.net> wrote: > > > > > My output from this is "21C" (with a bunch of zeros in front). With > > the "five byte padding" uncommented, it goes to "220". All we're > > "timing" is push edx/push eax/cpuid... is cpuid sensitive to > > alignment??? I would expect that if five bytes changes it, one byte > > would, too - but it doesn't (your mileage may vary)... > > bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm > flat assembler version 1.68 (16384 kilobytes memory) > 2 passes, 236 bytes. > bmaxa(a)maxa:~/fasm/test$ ./ttest > 000000000000017A > 0000000000000183 > 0000000000000183 > 0000000000000183 > 000000000000017A > 000000000000017A > 000000000000017A > 000000000000017A > 000000000000017A > 000000000000017A > 000000000000017A > 0000000000000183 > 000000000000017A > 000000000000017A > 0000000000000183 > 000000000000017A > bmaxa(a)maxa:~/fasm/test$ cat ttest.asm > ; fasm myprog.asm > ; > ; from Branimir Maksimovic > ; bugfixes from Nathan Baker > ; cruft from fbk :) > > format ELF executable > > segment writeable executable > > entry $ > > mov ebx, xtbl > > nop > nop > nop > nop > nop > nop > > mov ecx,16 > l1: > push ecx > > cpuid > rdtsc > push edx > push eax > > das > push eax > pop eax > push eax > pop eax > > cpuid > rdtsc > > pop ebx > sub eax, ebx > pop ecx > sub edx, ecx > > mov edi, ascbuf > call u64toha > > mov ecx, ascbuf > mov edx, 17 > mov ebx, 1 > mov eax, 4 > int 80h > > pop ecx > loop l1 > > exit: > mov eax, 1 > mov ebx,0 > int 80h > > xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \ > 43h,44h,45h,46h > > u64toha: > add edi, 15 > mov ebx,xtbl > mov cl, 16 > std > l2: > mov ch,al > and al,0xf > xlatb > stosb > mov al,ch > ; shrd edx,eax,4 > shrd eax,edx,4 > shr edx, 4 > dec cl > jz e1 > ; mov byte[edi], ',' > ; inc edi > jmp l2 > > e1: > cld > ret > > ascbuf db 17 dup (0xa) > bmaxa(a)maxa:~/fasm/test$ > > > Best, > > Frank > > > > Cheers! > bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm flat assembler version 1.68 (16384 kilobytes memory) 2 passes, 230 bytes. bmaxa(a)maxa:~/fasm/test$ ./ttest 000000000000019E 00000000000001A7 000000000000019E 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 000000000000019E 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 00000000000001A7 bmaxa(a)maxa:~/fasm/test$ cat ttest.asm ; fasm myprog.asm ; ; from Branimir Maksimovic ; bugfixes from Nathan Baker ; cruft from fbk :) format ELF executable segment writeable executable entry $ mov ebx, xtbl ;nop ;nop ;nop ;nop ;nop ;nop mov ecx,16 l1: push ecx cpuid rdtsc push edx push eax das push eax pop eax push eax pop eax cpuid rdtsc pop ebx sub eax, ebx pop ecx sub edx, ecx mov edi, ascbuf call u64toha mov ecx, ascbuf mov edx, 17 mov ebx, 1 mov eax, 4 int 80h pop ecx loop l1 exit: mov eax, 1 mov ebx,0 int 80h xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \ 43h,44h,45h,46h u64toha: add edi, 15 mov ebx,xtbl mov cl, 16 std l2: mov ch,al and al,0xf xlatb stosb mov al,ch ; shrd edx,eax,4 shrd eax,edx,4 shr edx, 4 dec cl jz e1 ; mov byte[edi], ',' ; inc edi jmp l2 e1: cld ret ascbuf db 17 dup (0xa) bmaxa(a)maxa:~/fasm/test$ Greets! -- http://maxa.homedns.org/ Sometimes online sometimes not Svima je "dozvoljeno" biti idiot i > mrak, ali samo neki to odaberu, |