From: James Harris on
On 27 May, 19:11, Frank Kotler <fbkot...(a)myfairpoint.net> wrote:

....

> What does this do? As I read it, if you haven't got "PProPII" defined,
> it doesn't do squat. In fact, you don't appear to do "cpuid" at all, if
> this isn't defined. You probably do want "cpuid", if your machine
> supports it - using Tasm, ya never know. :)
>
> I sympathize with your problems. My attempts to time anything have been
> "inconclusive".

I'm not sure I really follow the ongoing discussion but on the above
point I also found very inconsistent results with timing anything.
They didn't get resolved until I ditched cpuid. It itself took 124
cycles on one machine and appeared to contribute to the
inconsistencies, if not cause them.

I've had far better results from using xchg reg, mem on Intel. (On AMD
rdtscp can be used instead of rdtsc.)

> If you get this thing working, time "das" for me! I've
> got some code... very similar to what you've got here... which "works",
> I guess, but gives some weird results. "das" is all over the place.
> "push eax"/"pop eax" executes faster if you do it twice than if you do
> it once. Things like that make me wonder if there's a bug in the code
> that I haven't found, or if my PIV is just weird.

Try xchg reg, mem. Here are my timings for das on a Pentium M. Because
das really only makes sense following a subtraction I timed

sub al, 1
das

and got the following results

Reps Cycles
---- ------
1 1
2 3
3 3
4 4
5 5
6 5
7 6
8 8
9 9
10 8

I'd say that works out about 1 cycle per sub-das pair. Timings with
das on its own show up to three can execute in one cycle on this CPU,
decreasing slightly as higher numbers are attempted. But such numbers
of consecutive das operations are meaningless anyway!

James
From: Rob on
James Harris wrote:
> On 27 May, 19:11, Frank Kotler <fbkot...(a)myfairpoint.net> wrote:
>
> ....
>
> I'm not sure I really follow the ongoing discussion but on the above
> point I also found very inconsistent results with timing anything.
> They didn't get resolved until I ditched cpuid. It itself took 124
> cycles on one machine and appeared to contribute to the
> inconsistencies, if not cause them.
>
> I've had far better results from using xchg reg, mem on Intel. (On AMD
> rdtscp can be used instead of rdtsc.)
>
>> If you get this thing working, time "das" for me! I've
>> got some code... very similar to what you've got here... which "works",
>> I guess, but gives some weird results. "das" is all over the place.
>> "push eax"/"pop eax" executes faster if you do it twice than if you do
>> it once. Things like that make me wonder if there's a bug in the code
>> that I haven't found, or if my PIV is just weird.
>
> Try xchg reg, mem. Here are my timings for das on a Pentium M. Because
> das really only makes sense following a subtraction I timed
>
> sub al, 1
> das
>
> and got the following results
>
> Reps Cycles
> ---- ------
> 1 1
> 2 3
> 3 3
> 4 4
> 5 5
> 6 5
> 7 6
> 8 8
> 9 9
> 10 8
>
> I'd say that works out about 1 cycle per sub-das pair. Timings with
> das on its own show up to three can execute in one cycle on this CPU,
> decreasing slightly as higher numbers are attempted. But such numbers
> of consecutive das operations are meaningless anyway!
>
> James

I've been playing with a timing framework off and on (sort of a port of
Yodel from asmcommunity.net) using the Linux timer syscall. It's still
got a bit of work left on it, but if anyone wants to take a look (and
comments/improvements/bug identifications are more than welcome) I have
posted it at http://70.24.1.178/Forums/ala/timer.tar.bz2
I included the executable, but if the directory structure is maintained
(although you might have to move the linux.inc to your local path for
FASM). Increasing the time delay for calculating the CPU could improve
the accuracy too though.

It seems though, that the P4 is kind of the odd machine out - the AMD
handles it much better and seems to have fairly consistent timings even
with varying input to das.


Here's the results from my PIV:

> ## Test parameters: 10000000 iterations.
>
> /] Running performance tests: Intel(R) Pentium(R) 4 CPU 2.40GHz Processor @ 2400 MHz.
> Reference Procedure timing took 0.050498500s = 12.119 cycles/iteration
> das test --> 0.371625844s = 89.190 cycles/iteration
> das2 test --> 0.444781053s = 106.747 cycles/iteration
> Ultrano Test --> 0.395520856s = 94.925 cycles/iteration

And I tried it on an AMD:

> /] Running performance tests: AMD Athlon(tm) XP 2100+ Processor @ 1737 MHz.
> Reference Procedure timing took 0.034755937s = 6.037 cycles/iteration
> das test --> 0.035735015s = 6.207 cycles/iteration
> das2 test --> 0.034670268s = 6.022 cycles/iteration
> Ultrano Test --> 0.121370968s = 21.082 cycles/iteration

Here's the code I timed:

> time1_name db "das test",0
> align 16
> time1:
> mov eax,10
> das
> retn
>
> time2_name db "das2 test",0
> align 16
> time2:
> mov eax,10 shl 8
> das
> retn
>
> time3_name db "Ultrano Test",0
> align 16
> time3:
> mov ax,10 shl 8
> aad
> aam
> mov al,8
> sub al,3
> xchg ah,al
> aad
> retn

From: Nathan on
On May 30, 6:35 pm, Rob <junkma...(a)lavabit.com> wrote:
>
> I've been playing with a timing framework off and on (sort of a port of
> Yodel from asmcommunity.net) using the Linux timer syscall.  It's still
> got a bit of work left on it, but if anyone wants to take a look (and
> comments/improvements/bug identifications are more than welcome) I have
> posted it athttp://70.24.1.178/Forums/ala/timer.tar.bz2
> I included the executable, but if the directory structure is maintained
> (although you might have to move the linux.inc to your local path for
> FASM).  Increasing the time delay for calculating the CPU could improve
> the accuracy too though.
>

Interesting code. Here are results from an Atom-powered netbook:

######## Yodel (sort of) Linux Port version 0.1, 2010/03/20
## Calculating clockspeed...
(Your computer might temporarily appear frozen as process priority is
being boosted to level 99)

## Test parameters: 10000000 iterations.

/] Running performance tests: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Processor @ 1596 MHz.
Reference Procedure timing took 0.088939013s = 14.194 cycles/
iteration
das test --> 0.075400169s = 12.033 cycles/
iteration
das2 test --> 0.076278460s = 12.174 cycles/
iteration
Ultrano Test --> 0.207771496s = 33.160 cycles/
iteration

Nathan.
From: Branimir Maksimovic on
On Sun, 30 May 2010 05:16:03 -0400
Frank Kotler <fbkotler(a)myfairpoint.net> wrote:

> Okay... It's *your* code (with Nathan's changes)...
>
> ; fasm myprog.asm
> ;
> ; from Branimir Maksimovic
> ; bugfixes from Nathan Baker
> ; cruft from fbk :)
>
> format ELF executable
>
> segment writeable executable
>
> entry $
>
> ; five bytes here changes the timing
> ;mov ebx, xtbl
>
> ;nop
> ;nop
> ;nop
> ;nop
> ;nop
> ;nop ; six bytes changes it back
>
> mov ecx,16
> l1:
> push ecx
>
> ; serialize CPU and get start time
> cpuid
> rdtsc
> push edx
> push eax
>
> ; code to be timed
> ;--------------
> ;das
> ;push eax
> ;pop eax
> ;push eax
> ;pop eax
> ;--------------
>
> ; serialize cpu and get end time
> cpuid
> rdtsc
>
> ; calculate difference
> pop ebx
> sub eax, ebx
> pop ecx
> sub edx, ecx
>
> ; convert number to text
> mov edi, ascbuf
> call u64toha
>
> ; print it
> mov ecx, ascbuf
> mov edx, 17
> mov ebx, 1
> mov eax, 4
> int 80h
>
> ; do more
> pop ecx
> loop l1
>
> exit:
> mov eax, 1
> mov ebx,0
> int 80h
>
> xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \
> 43h,44h,45h,46h
>
> ; I changed the name of this - 'd' implied "decimal"...
> u64toha:
> add edi, 15
> mov ebx,xtbl
> mov cl, 16
> std
> l2:
> mov ch,al
> and al,0xf
> xlatb
> stosb
> mov al,ch
> ; shrd edx,eax,4
> shrd eax,edx,4
> shr edx, 4
> dec cl
> jz e1
> ; mov byte[edi], ','
> ; inc edi
> jmp l2
>
> e1:
> cld
> ret
>
> ascbuf db 17 dup (0xa)
> ;---------------------------
>
> My output from this is "21C" (with a bunch of zeros in front). With
> the "five byte padding" uncommented, it goes to "220". All we're
> "timing" is push edx/push eax/cpuid... is cpuid sensitive to
> alignment??? I would expect that if five bytes changes it, one byte
> would, too - but it doesn't (your mileage may vary)...

bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm
flat assembler version 1.68 (16384 kilobytes memory)
2 passes, 236 bytes.
bmaxa(a)maxa:~/fasm/test$ ./ttest
000000000000017A
0000000000000183
0000000000000183
0000000000000183
000000000000017A
000000000000017A
000000000000017A
000000000000017A
000000000000017A
000000000000017A
000000000000017A
0000000000000183
000000000000017A
000000000000017A
0000000000000183
000000000000017A
bmaxa(a)maxa:~/fasm/test$ cat ttest.asm
; fasm myprog.asm
;
; from Branimir Maksimovic
; bugfixes from Nathan Baker
; cruft from fbk :)

format ELF executable

segment writeable executable

entry $

mov ebx, xtbl

nop
nop
nop
nop
nop
nop

mov ecx,16
l1:
push ecx

cpuid
rdtsc
push edx
push eax

das
push eax
pop eax
push eax
pop eax

cpuid
rdtsc

pop ebx
sub eax, ebx
pop ecx
sub edx, ecx

mov edi, ascbuf
call u64toha

mov ecx, ascbuf
mov edx, 17
mov ebx, 1
mov eax, 4
int 80h

pop ecx
loop l1

exit:
mov eax, 1
mov ebx,0
int 80h

xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \
43h,44h,45h,46h

u64toha:
add edi, 15
mov ebx,xtbl
mov cl, 16
std
l2:
mov ch,al
and al,0xf
xlatb
stosb
mov al,ch
; shrd edx,eax,4
shrd eax,edx,4
shr edx, 4
dec cl
jz e1
; mov byte[edi], ','
; inc edi
jmp l2

e1:
cld
ret

ascbuf db 17 dup (0xa)
bmaxa(a)maxa:~/fasm/test$

> Best,
> Frank
>

Cheers!

--
http://maxa.homedns.org/

Sometimes online sometimes not

Svima je "dozvoljeno" biti idiot i
> mrak, ali samo neki to odaberu,


From: Branimir Maksimovic on
On Mon, 31 May 2010 13:25:29 +0200
Branimir Maksimovic <bmaxa(a)hotmail.com> wrote:

> On Sun, 30 May 2010 05:16:03 -0400
> Frank Kotler <fbkotler(a)myfairpoint.net> wrote:
>
> >
> > My output from this is "21C" (with a bunch of zeros in front). With
> > the "five byte padding" uncommented, it goes to "220". All we're
> > "timing" is push edx/push eax/cpuid... is cpuid sensitive to
> > alignment??? I would expect that if five bytes changes it, one byte
> > would, too - but it doesn't (your mileage may vary)...
>
> bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm
> flat assembler version 1.68 (16384 kilobytes memory)
> 2 passes, 236 bytes.
> bmaxa(a)maxa:~/fasm/test$ ./ttest
> 000000000000017A
> 0000000000000183
> 0000000000000183
> 0000000000000183
> 000000000000017A
> 000000000000017A
> 000000000000017A
> 000000000000017A
> 000000000000017A
> 000000000000017A
> 000000000000017A
> 0000000000000183
> 000000000000017A
> 000000000000017A
> 0000000000000183
> 000000000000017A
> bmaxa(a)maxa:~/fasm/test$ cat ttest.asm
> ; fasm myprog.asm
> ;
> ; from Branimir Maksimovic
> ; bugfixes from Nathan Baker
> ; cruft from fbk :)
>
> format ELF executable
>
> segment writeable executable
>
> entry $
>
> mov ebx, xtbl
>
> nop
> nop
> nop
> nop
> nop
> nop
>
> mov ecx,16
> l1:
> push ecx
>
> cpuid
> rdtsc
> push edx
> push eax
>
> das
> push eax
> pop eax
> push eax
> pop eax
>
> cpuid
> rdtsc
>
> pop ebx
> sub eax, ebx
> pop ecx
> sub edx, ecx
>
> mov edi, ascbuf
> call u64toha
>
> mov ecx, ascbuf
> mov edx, 17
> mov ebx, 1
> mov eax, 4
> int 80h
>
> pop ecx
> loop l1
>
> exit:
> mov eax, 1
> mov ebx,0
> int 80h
>
> xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \
> 43h,44h,45h,46h
>
> u64toha:
> add edi, 15
> mov ebx,xtbl
> mov cl, 16
> std
> l2:
> mov ch,al
> and al,0xf
> xlatb
> stosb
> mov al,ch
> ; shrd edx,eax,4
> shrd eax,edx,4
> shr edx, 4
> dec cl
> jz e1
> ; mov byte[edi], ','
> ; inc edi
> jmp l2
>
> e1:
> cld
> ret
>
> ascbuf db 17 dup (0xa)
> bmaxa(a)maxa:~/fasm/test$
>
> > Best,
> > Frank
> >
>
> Cheers!
>

bmaxa(a)maxa:~/fasm/test$ fasm ttest.asm
flat assembler version 1.68 (16384 kilobytes memory)
2 passes, 230 bytes.
bmaxa(a)maxa:~/fasm/test$ ./ttest
000000000000019E
00000000000001A7
000000000000019E
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
000000000000019E
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
00000000000001A7
bmaxa(a)maxa:~/fasm/test$ cat ttest.asm
; fasm myprog.asm
;
; from Branimir Maksimovic
; bugfixes from Nathan Baker
; cruft from fbk :)

format ELF executable

segment writeable executable

entry $

mov ebx, xtbl

;nop
;nop
;nop
;nop
;nop
;nop

mov ecx,16
l1:
push ecx

cpuid
rdtsc
push edx
push eax

das
push eax
pop eax
push eax
pop eax

cpuid
rdtsc

pop ebx
sub eax, ebx
pop ecx
sub edx, ecx

mov edi, ascbuf
call u64toha

mov ecx, ascbuf
mov edx, 17
mov ebx, 1
mov eax, 4
int 80h

pop ecx
loop l1

exit:
mov eax, 1
mov ebx,0
int 80h

xtbl db 30h,31h,32h,33h,34h,35h,36h,37h,38h,39h,41h,42h, \
43h,44h,45h,46h

u64toha:
add edi, 15
mov ebx,xtbl
mov cl, 16
std
l2:
mov ch,al
and al,0xf
xlatb
stosb
mov al,ch
; shrd edx,eax,4
shrd eax,edx,4
shr edx, 4
dec cl
jz e1
; mov byte[edi], ','
; inc edi
jmp l2

e1:
cld
ret

ascbuf db 17 dup (0xa)
bmaxa(a)maxa:~/fasm/test$

Greets!

--
http://maxa.homedns.org/

Sometimes online sometimes not

Svima je "dozvoljeno" biti idiot i
> mrak, ali samo neki to odaberu,