From: xiaosi on 9 Sep 2009 10:47 I use __rdtsc()[1], the results include the time of rdtsc itself and call ret mov (17 clocks on my AMD32 cpu when cache is all hit). [1] #define WIN32_LEAN_AND_MEAN #include <windows.h> //SecureZeroMemory #include <stdio.h> #include <intrin.h> __declspec(noinline) void __stdcall func0() { volatile int i = 0; } __declspec(noinline) void __stdcall func1() { long si[88/4]; SecureZeroMemory(si, sizeof(si)); } __declspec(noinline) void __stdcall func2() { long si[88/4]; __stosb((unsigned char*)si, 0, sizeof(si)); } #pragma optimize("gt", on) #define icount 10 int __cdecl main() { int i; unsigned long long t, t0[icount], t1[icount], t2[icount]; for (i = 0; i < icount; i++) { t = __rdtsc(); func0(); t0[i] = __rdtsc() - t; } for (i = 0; i < icount; i++) { t = __rdtsc(); func1(); t1[i] = __rdtsc() - t; } for (i = 0; i < icount; i++) { t = __rdtsc(); func2(); t2[i] = __rdtsc() - t; } printf("func0\tfunc1\tfunc2\n"); for (i = 0; i < icount; i ++) { printf("%I64u\t%I64u\t%I64u\n", t0[i], t1[i], t2[i]); } return 0; } #pragma optimize("", on) "Vincent Fatica" <vince(a)blackholespam.net> wrote: > On Wed, 9 Sep 2009 17:07:18 +0800, "xiaosi" <xiaosi(a)cn99.com> wrote: > > |It's strange that why to use loop instead of __stosb on none _M_AMD64 cpu. > |On my AMD32 cpu, __stosb (114 clocks) is faster than this loop (195 clocks). > > How do you time such things? > -- > - Vince
From: Tim Roberts on 9 Sep 2009 23:07 Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: > >But I wonder why they #define'd RtlZeroMemory as an alias to memset in >WinNT.h ... Because "memset" is a compiler intrinsic that can be inlined to a "rep stosb". -- Tim Roberts, timr(a)probo.com Providenza & Boekelheide, Inc.
From: Vincent Fatica on 10 Sep 2009 00:39 On Wed, 09 Sep 2009 20:07:57 -0700, Tim Roberts <timr(a)probo.com> wrote: |Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: |> |>But I wonder why they #define'd RtlZeroMemory as an alias to memset in |>WinNT.h ... | |Because "memset" is a compiler intrinsic that can be inlined to a "rep |stosb". But it's not inlined to "rep stosb" (at least by VC9). -- - Vince
From: Bo Persson on 10 Sep 2009 16:22 Vincent Fatica wrote: > On Wed, 09 Sep 2009 20:07:57 -0700, Tim Roberts <timr(a)probo.com> > wrote: > >> Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: >>> >>> But I wonder why they #define'd RtlZeroMemory as an alias to >>> memset in WinNT.h ... >> >> Because "memset" is a compiler intrinsic that can be inlined to a >> "rep stosb". > > But it's not inlined to "rep stosb" (at least by VC9). Because "rep stosb" was fast once-upon-a-time (early 1980's, or so), but the weird design of current processors actually makes them run faster if you spell it all out explicitly. A short sequence of simple instructions might run faster than a single specialized instruction. Honest! Bo Persson
From: Tim Roberts on 11 Sep 2009 23:26
"Bo Persson" <bop(a)gmb.dk> wrote: > >Vincent Fatica wrote: >> >> But it's not inlined to "rep stosb" (at least by VC9). > >Because "rep stosb" was fast once-upon-a-time (early 1980's, or so), >but the weird design of current processors actually makes them run >faster if you spell it all out explicitly. A short sequence of simple >instructions might run faster than a single specialized instruction. >Honest! Not true. If you're doing less than 7 or 8 iterations, you're right. Beyond that, "rep stosd" wins. It does one dword per cycle, and it's hard to beat that, without getting into the more obscure instruction sets. -- Tim Roberts, timr(a)probo.com Providenza & Boekelheide, Inc. |