From: xiaosi on
I use __rdtsc()[1], the results include the time of rdtsc itself and call ret mov (17 clocks on my AMD32 cpu when cache is all hit).

[1]
#define WIN32_LEAN_AND_MEAN
#include <windows.h> //SecureZeroMemory
#include <stdio.h>
#include <intrin.h>

__declspec(noinline) void __stdcall func0() {
volatile int i = 0;
}

__declspec(noinline) void __stdcall func1() {
long si[88/4];
SecureZeroMemory(si, sizeof(si));
}

__declspec(noinline) void __stdcall func2() {
long si[88/4];
__stosb((unsigned char*)si, 0, sizeof(si));
}

#pragma optimize("gt", on)
#define icount 10
int __cdecl main() {
int i; unsigned long long t, t0[icount], t1[icount], t2[icount];

for (i = 0; i < icount; i++) {
t = __rdtsc();
func0();
t0[i] = __rdtsc() - t;
}

for (i = 0; i < icount; i++) {
t = __rdtsc();
func1();
t1[i] = __rdtsc() - t;
}

for (i = 0; i < icount; i++) {
t = __rdtsc();
func2();
t2[i] = __rdtsc() - t;
}

printf("func0\tfunc1\tfunc2\n");
for (i = 0; i < icount; i ++) {
printf("%I64u\t%I64u\t%I64u\n", t0[i], t1[i], t2[i]);
}
return 0;
}
#pragma optimize("", on)

"Vincent Fatica" <vince(a)blackholespam.net> wrote:
> On Wed, 9 Sep 2009 17:07:18 +0800, "xiaosi" <xiaosi(a)cn99.com> wrote:
>
> |It's strange that why to use loop instead of __stosb on none _M_AMD64 cpu.
> |On my AMD32 cpu, __stosb (114 clocks) is faster than this loop (195 clocks).
>
> How do you time such things?
> --
> - Vince

From: Tim Roberts on
Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:
>
>But I wonder why they #define'd RtlZeroMemory as an alias to memset in
>WinNT.h ...

Because "memset" is a compiler intrinsic that can be inlined to a "rep
stosb".
--
Tim Roberts, timr(a)probo.com
Providenza & Boekelheide, Inc.
From: Vincent Fatica on
On Wed, 09 Sep 2009 20:07:57 -0700, Tim Roberts <timr(a)probo.com> wrote:

|Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:
|>
|>But I wonder why they #define'd RtlZeroMemory as an alias to memset in
|>WinNT.h ...
|
|Because "memset" is a compiler intrinsic that can be inlined to a "rep
|stosb".

But it's not inlined to "rep stosb" (at least by VC9).
--
- Vince
From: Bo Persson on
Vincent Fatica wrote:
> On Wed, 09 Sep 2009 20:07:57 -0700, Tim Roberts <timr(a)probo.com>
> wrote:
>
>> Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote:
>>>
>>> But I wonder why they #define'd RtlZeroMemory as an alias to
>>> memset in WinNT.h ...
>>
>> Because "memset" is a compiler intrinsic that can be inlined to a
>> "rep stosb".
>
> But it's not inlined to "rep stosb" (at least by VC9).

Because "rep stosb" was fast once-upon-a-time (early 1980's, or so),
but the weird design of current processors actually makes them run
faster if you spell it all out explicitly. A short sequence of simple
instructions might run faster than a single specialized instruction.
Honest!


Bo Persson


From: Tim Roberts on
"Bo Persson" <bop(a)gmb.dk> wrote:
>
>Vincent Fatica wrote:
>>
>> But it's not inlined to "rep stosb" (at least by VC9).
>
>Because "rep stosb" was fast once-upon-a-time (early 1980's, or so),
>but the weird design of current processors actually makes them run
>faster if you spell it all out explicitly. A short sequence of simple
>instructions might run faster than a single specialized instruction.
>Honest!

Not true. If you're doing less than 7 or 8 iterations, you're right.
Beyond that, "rep stosd" wins. It does one dword per cycle, and it's hard
to beat that, without getting into the more obscure instruction sets.
--
Tim Roberts, timr(a)probo.com
Providenza & Boekelheide, Inc.