From: Kenneth 'Bessarion' Boyd on 8 Dec 2009 08:47 On Dec 7, 1:55 pm, Goran Pusic <gor...(a)cse-semaphore.com> wrote: > On Dec 6, 1:07 am, Le Chaud Lapin <jaibudu...(a)gmail.com> wrote: > > Certain x86 memory movement instructions are much faster than calls to > > memcpy, which simply employs those same instructions internally along > > with unnecessary overhead. Back when I actually tested this (Win16, ~1996): it's DLL-imported memmove that acts like it's using those instructions (and even then the DLL import overhead was barely measurable). The DLL imported version of memcpy caused a 1.5x slowdown relative to memmove. So I changed all of my code targeting Windows to always use memmove, rather than mess with assembly programming. It makes trying to port to *NIX much easier not having to think about assembly language. > (I find it hard to believe that your concern is code size, it's about > speed, right? If so...) > > Are you sure about that overhead? I just made a smallest possible > memmove function I could think of (cld, init esi/edi/ecx, rep movsd). Reads like memcpy to me, as memmove has to be safe when the source and destination memory blocks overlap while memcpy doesn't. When I checked, the speed difference between assembly-handwritten memcpy and assembly-handwritten memmove for Intel wasn't measurable. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Goran Pusic on 9 Dec 2009 05:56 On Dec 9, 1:41 am, Le Chaud Lapin <jaibudu...(a)gmail.com> wrote: > Having written quite a bit of x86 assembly Ye Olden Days, I find it > hard to believe that the difference is "statiscally irrelevant" > between a movs and the 200+ instructions in full version of memcpy, at > least 95 of which gets executed for a stock operator =. That's > excluding stack manipulaation and function calls. > Did you try to code it faster? It's been a couple of days since this started ;-). It's clearly the question of data size. If it's big enough, memcpy or asm won't matter, because time needed for rep movsd (which is what memcpy of MS CRT uses) will swamp all else. But, now that you voiced your disbelief of my utterly opaque, yet highly scientific test ;-), I thought, perhaps my struct size was too big (~22K). I tried with smaller, ~12K. Nope, still the same. (Stock PC, 32-bit code on 64-bit Windows). ~8k, same. (Again, inline or not, doesn't matter). And finally, I saw a strange thing when I approach 4k: suddenly, stock operator= (which __is__ memcpy) and manual memcpy become faster than mine asm! That's something I didn't expect. Must be related to some hardware effect memcpy knows about and I don't. (Either that, or there's a flaw in my test. But as now I passed through all the moves, I am convinced - you should leave your optimization idea aside. It's __false__. Try applying first rule of optimization: 1. make code faster by changing the design to eliminate hotspots (precluded by: find hotspots) Goran. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
|
Pages: 1 Prev: std::map element in heap or stack? Next: C++ library that offers tensors? |