From: gast128 on 11 Mar 2010 11:42 Hello all, this may be a difficult to explain problem, and I need some assembly to show the difference. In a DLL we export some STL containers to minimize code bloat, like: template class __declspec(dllexport) std::vector<int>; typedef std::vector<int> int_vector; In a simple test probgram I see now a huge difference in performance. The c++ function is as follows (same as std::fill, but this is just example): void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) { for (size_t n = 0; n != nLoop; ++n) { const int_vector::iterator itEnd = pVector->end(); for (int_vector::iterator it = pVector->begin(); it != itEnd; + +it) { *it = nValue; } } } In the assembly code somehow exception handling has been put in, and this gets updated in the loop, which is major performance issue (see '//! <- difference'): void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) { 00401D30 push 0FFFFFFFFh 00401D32 push offset __ehhandler$?PrfMemoryIterator@@YAXPAV? $vector(a)HV?$allocator@H(a)std@@@std@@HI@Z (403718h) 00401D37 mov eax,dword ptr fs:[00000000h] 00401D3D push eax 00401D3E mov dword ptr fs:[0],esp 00401D45 sub esp,4Ch 00401D48 mov eax,dword ptr [___security_cookie (406270h)] 00401D4D xor eax,esp 00401D4F push edi 00401D50 mov edi,ecx <snip> for (int_vector::iterator it = pVector->begin(); it != itEnd; + +it) 00401D7D lea ecx,[esp+4] 00401D81 push ecx 00401D82 mov ecx,ebx 00401D84 call dword ptr [__imp_std::vector<int,std::allocator<int> >::begin (404004h)] 00401D8A mov eax,dword ptr [esp+4] 00401D8E cmp eax,dword ptr [esp+8] 00401D92 je PrfMemoryIterator+79h (401DA9h) { *it = nValue; 00401D94 mov dword ptr [eax],esi 00401D96 mov eax,dword ptr [esp+4] //! <- difference 00401D9A mov ecx,dword ptr [esp+8] //! <- difference 00401D9E add eax,4 00401DA1 cmp eax,ecx 00401DA3 mov dword ptr [esp+4],eax //! <- difference 00401DA7 jne PrfMemoryIterator+64h (401D94h) However if we not export the STL containers, the generated code is different: void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) { 00401F60 sub esp,44h 00401F63 mov eax,dword ptr [___security_cookie (406290h)] 00401F68 xor eax,esp 00401F6A push edi 00401F6B mov edi,ecx <snip> for (int_vector::iterator it = pVector->begin(); it != itEnd; + +it) 00401F86 mov eax,dword ptr [ebx+4] 00401F89 cmp eax,ecx 00401F8B je PrfMemoryIterator+39h (401F99h) 00401F8D lea ecx,[ecx] { *it = nValue; 00401F90 mov dword ptr [eax],esi 00401F92 add eax,4 00401F95 cmp eax,ecx 00401F97 jne PrfMemoryIterator+30h (401F90h) I use vstudio 2003 here, but I noticed something similar with the _SECURE_SCL option in vstudio 2008, which also makes a difference from a performance perspective . Can anyone help? It is probably somewhere in the exception handling corner, however why would this make a difference when using exported classes or not? Thx in advance.
From: Alexander Grigoriev on 11 Mar 2010 22:50 Normally, the STL-generated code can get heavily optimized and inlined. But if you export the code, the no-inline functions will be used. <gast128(a)hotmail.com> wrote in message news:09ae418f-3610-4ef5-8df2-d41d7e45eed5(a)g19g2000yqe.googlegroups.com... > Hello all, > > this may be a difficult to explain problem, and I need some assembly > to show the difference. In a DLL we export some STL containers to > minimize code bloat, like: > > > template class __declspec(dllexport) std::vector<int>; > typedef std::vector<int> int_vector; > > > In a simple test probgram I see now a huge difference in performance. > The c++ function is as follows (same as std::fill, but this is just > example): > > > void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) > { > for (size_t n = 0; n != nLoop; ++n) > { > const int_vector::iterator itEnd = pVector->end(); > > for (int_vector::iterator it = pVector->begin(); it != itEnd; + > +it) > { > *it = nValue; > } > } > } > > > In the assembly code somehow exception handling has been put in, and > this gets updated in the loop, which is major performance issue (see > '//! <- difference'): > > > void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) > { > 00401D30 push 0FFFFFFFFh > 00401D32 push offset __ehhandler$?PrfMemoryIterator@@YAXPAV? > $vector(a)HV?$allocator@H(a)std@@@std@@HI@Z (403718h) > 00401D37 mov eax,dword ptr fs:[00000000h] > 00401D3D push eax > 00401D3E mov dword ptr fs:[0],esp > 00401D45 sub esp,4Ch > 00401D48 mov eax,dword ptr [___security_cookie (406270h)] > 00401D4D xor eax,esp > 00401D4F push edi > 00401D50 mov edi,ecx > > <snip> > > for (int_vector::iterator it = pVector->begin(); it != itEnd; + > +it) > 00401D7D lea ecx,[esp+4] > 00401D81 push ecx > 00401D82 mov ecx,ebx > 00401D84 call dword ptr > [__imp_std::vector<int,std::allocator<int> >::begin (404004h)] > 00401D8A mov eax,dword ptr [esp+4] > 00401D8E cmp eax,dword ptr [esp+8] > 00401D92 je PrfMemoryIterator+79h (401DA9h) > { > *it = nValue; > 00401D94 mov dword ptr [eax],esi > 00401D96 mov eax,dword ptr [esp+4] //! <- difference > 00401D9A mov ecx,dword ptr [esp+8] //! <- difference > 00401D9E add eax,4 > 00401DA1 cmp eax,ecx > 00401DA3 mov dword ptr [esp+4],eax //! <- difference > 00401DA7 jne PrfMemoryIterator+64h (401D94h) > > > However if we not export the STL containers, the generated code is > different: > > > void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) > { > 00401F60 sub esp,44h > 00401F63 mov eax,dword ptr [___security_cookie (406290h)] > 00401F68 xor eax,esp > 00401F6A push edi > 00401F6B mov edi,ecx > > <snip> > > for (int_vector::iterator it = pVector->begin(); it != itEnd; + > +it) > 00401F86 mov eax,dword ptr [ebx+4] > 00401F89 cmp eax,ecx > 00401F8B je PrfMemoryIterator+39h (401F99h) > 00401F8D lea ecx,[ecx] > { > *it = nValue; > 00401F90 mov dword ptr [eax],esi > 00401F92 add eax,4 > 00401F95 cmp eax,ecx > 00401F97 jne PrfMemoryIterator+30h (401F90h) > > > I use vstudio 2003 here, but I noticed something similar with the > _SECURE_SCL option in vstudio 2008, which also makes a difference from > a performance perspective . > > Can anyone help? It is probably somewhere in the exception handling > corner, however why would this make a difference when using exported > classes or not? > > Thx in advance.
From: gast128 on 12 Mar 2010 03:18 On Mar 12, 4:50 am, "Alexander Grigoriev" <al...(a)earthlink.net> wrote: > Normally, the STL-generated code can get heavily optimized and inlined. But > if you export the code, the no-inline functions will be used. > > 00401D92 je PrfMemoryIterator+79h (401DA9h) > > { > > *it = nValue; > > 00401D94 mov dword ptr [eax],esi > > 00401D96 mov eax,dword ptr [esp+4] //! <- difference > > 00401D9A mov ecx,dword ptr [esp+8] //! <- difference > > 00401D9E add eax,4 > > 00401DA1 cmp eax,ecx > > 00401DA3 mov dword ptr [esp+4],eax //! <- difference > > 00401DA7 jne PrfMemoryIterator+64h (401D94h) Yes but an optimizer could conclude from the assembly code that it stores and loads the value of the eax again and again in [esp + 4]. Even the ecx register gets reloaded all the time, with being changed in the loop. So my conclusion would be that it somehow is essential that this eax value gets written back to [esp + 4] in the loop or otherwise it may be a bug. I also do not use the volatile keyword, so the optimizer is freely to use all its power.
From: gast128 on 14 Mar 2010 19:48 I made 2 changes to the original code: 1) use const_iterator as end iterator 2) pulled iterator out of loop And now the values of the iterator aren't reloaded again and again in the for loop. No idea why; a compiler specialist could help here? void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop) { PRF_FUNCTION(); for (size_t n = 0; n != nLoop; ++n) { const int_vector::const_iterator itEnd = pVector->end(); int_vector::iterator it; for (it = pVector->begin(); it != itEnd; ++it) { *it = nValue; } } } I saw alos another nice effect (which may or may not be related): 'Inconsistent inlining of C++ class template member functions across DLLs' https://connect.microsoft.com/VisualStudio/feedback/details/511979/inconsistent-inlining-of-c-class-template-member-functions-across-dlls
|
Pages: 1 Prev: Binary Diff Utility Next: Template classes with virtual member functions |