From: Kerem G�mr�kc� on
Hi Joseph,

>Inline assembly code? Scary...

I wont say scarry,..i rather would say a "necessity" for some
operations that must be performed as fast as possible. You as
a professional and very experienced developer should know
what i am talking about, especially in the case of mathematical
calculations and algorithms, there is nothing that can beat fast
direct cpu uperations with dedicated instructions and fast register
access. You can strip down your code and compact it as much
as possible and write your own prolog and epilog and create
your own operative mechanism. But you are right many many
developers still "underrate" the power of assembly in high level
languages and most of them still think, that assembly is difficult
and something like black magic. A friend of mine, a pro
..NET Developer always thinks that Assembly is dead. Thats
"scary" to me. I also do .NET development, and very often,
but there is nothing that can beat assembly and C/C++. Operating
directly on hardware is the closest level to speed and fills the
gap of the sometimes missed speed or missing features of
runtime libraries. But on the other hand, if you are not used
to assembly you can slow down someting that can be much
faster with right high level code and good compiler optimization.
This is the case if someone does not have a good knowledge of
low level coding and has a lot of "intermediate" assembly instructions
which cann e.g. solved with a special instruction in one row and
cycle insted of throwing out a lot of moves, pushes, jumps and
aritmethic operations. I always high recommend everybody to learn
at least the basics of assembly language. Even if someone does
not use asembly this makes you understand how function calls,
memory operations and "natve debigging" works. I very good
book on this was a book i read years ago from Joh Robbins,
named "Debugging Windows". Highly recommended! I dont
know whether this book is still up to date (new releases?)
but even for today it is a very good book for learning how
basic debugging works...


Regards

K.


--
-----------------------
Beste Gr�sse / Best regards / Votre bien devoue
Kerem G�mr�kc�
Microsoft Live Space: http://kerem-g.spaces.live.com/
Latest Open-Source Projects: http://entwicklung.junetz.de
-----------------------
"This reply is provided as is, without warranty express or implied."


From: Kerem G�mr�kc� on
This seems to be the latest release, .NET debugging,...nah, ILDASM
and go,...

http://www.amazon.com/Debugging-Microsoft-NET-2-0-Applications/dp/0735622027/ref=pd_bbs_sr_1/105-9103892-2545257?ie=UTF8&s=books&qid=1205088615&sr=1-1

Regards

K.

--
-----------------------
Beste Gr�sse / Best regards / Votre bien devoue
Kerem G�mr�kc�
Microsoft Live Space: http://kerem-g.spaces.live.com/
Latest Open-Source Projects: http://entwicklung.junetz.de
-----------------------
"This reply is provided as is, without warranty express or implied."


From: Joseph M. Newcomer on
As a professional, one thing I know (and I spent years working on optimizing compilers,
and worked for a company that produced them) is that nearly all the time the compiler can
produce better code than a programmer writing assembly code.

What is scary is that you are creating code which is expensive to write, expensive to
debug, and expensive to maintain, without a quantitative justification for the performance
improvement.

I have a friend who didn't like the code he was getting, so he wrote a program that
transformed his computation into a collection of grossly-ugly goto-style code. This code
ran substantially faster, but it had the advantage that his algorithms were always written
in C, then compiled into C that the (rather weak) optimizing compiler he had available to
him would compile into more efficient code. Key here was that he never actually would
write code as ugly as his tool produced, but it didn't matter. He didn't write that code.
And his tool was generally useful for a variety of problems, not just one particular piece
of code (all involved repetitive array computations). His machine had no cache, so he
didn't worry about cache hits. He later extended it to do function inlining (his C
compiler had no __inline directive) and actually did get an order of magnitude performance
improvement.

Note that the only way to measure code performance is in the release version; no debug
code can ever be used as a benchmark or a criterion for determining computational cost.
Use #pragma to turn on every possible optimization in the code (note that some
optimizations are not "generally" safe, such as antialiasing, but in the context of, say,
a mathematical computation on arrays will buy a lot). When possible, use inlines.
Optimizations such as strength reduction, loop unrollilng, alpha motion, omega motion, and
common subexpression elimination can often be done more efficiently by writing in C/C++.

I have been using optimizing compilers since 1969, and the number of times I have found
myself able to beat the compiler is vanishingly small. To me, assembly code is easy to
read, but not cost-effective to write.

Example: we had to do an FFT of integer data. I converted the data from integer to double
in a copy of the array, passed it to the FFT subroutine, took the converted array,
converted it back to integers for plotting, and we could not detect the impact of this on
the overall performance; it appeared to operating in "real time". The major overheads of
the operation were the new/delete of the double array and the new/delete of the int array,
and in a real-time system they were still unnoticeable. The major computation was in the
FFT algorithm, a proprietary algorithm developed in MATLAB by a numerical methods expert;
MATLAB emitted C code to do the computaiton.

We carefully examined the compiler-generated code in the FFT subroutine. Two
assembly-code experts could not come up with anything significantly faster...we might have
managed to get 3%-5% out of it at best, not worth the effort.

Writing assembly code can, under extreme conditions, get you as much as a factor of 2
performance performance improvement. Changing your code to maximize L1 and L2 cache hits
can buy you a factor of 10 to 20, while remaining in C/C++. If you really care about
performance, data access organization is vastly more important than the cost of an
instruction in an inner loop. So if you are concentrating on instructions, you are
missing the high-payoff optimizations which are not code optimizations, but architecture
optimizations (you change your algorithm).

Note that if you are working on large data arrays, paging can become the dominant problem.
A page fault costs you about six orders of magnitude performance. All the assembly code
in the world will not "reimburse" you for a single page fault.

Some years ago, I wrote the world's fastest storage allocator, and to do this I did NOT
consider assembly code, but used a high-level language comparable to Ada. It had four
levels of abstraction between the user and the actual memory blocks. A good optimizing
compiler, with strong hints from __inline directives, can reduce three levels of
abstraction to half an instruction. The equivalent of malloc, had we actually had an
__inline capability (which was going to be in a future compiler release) would have been
__inline PVOID allocate(int n) {
if(n > limit)
general_allocator(n);
else
{
PVOID result = head[
(n + quantum-1)/quantum];
if(result == NULL)
return general_allocator(n);
else
{
head = head->next;
return result;
}
}

which in our compiler, had we done the inlining, would have generated the equivalent of
mov eax, head[7]
test eax
jne $1
push 28
call general_allocator
jmp $2
$1: mov ebx, DWORD PTR[eax]
mov head[7], ebx
$2:

Note that limit was a compile-time constant and n almost always was a compile-time
constant. It would have taken 5 instructions to allocate storage in most cases, which
means it would take, in a modern machine, <10ns to do a storage allocation (single thread
assumption here) [it took us 5us, because n that machine, it was one instruction/us,
2000-3000 times slower than a modern machine]. We didn't need to write it in assembler to
get that performance. (As it turned out, because of parameter passing, it took us 4 extra
instructions to call, and because at that point the value n was no longer a CTC, it took 3
extra instructions to implement the if-test, so inlining would have bought nearly a
factor-of-2 performance increase in allocation with zero effort on our part. It was not a
high priority because the allocator accounted for < 1% of the total execution time in an
allocation-heavy application where we would allocate and free tends of thousands of
objects).

I've written hundreds of thousands of lines of assembly code in my career; possibly as
many as half a million. For cost-effectiveness, nothing beats a good optimizing compiler.
For performance, with very rare exceptions, nothing beats a good optimizing compiler.

If my goal was to write the fastest possible inner loop for a mathematical computation, I
would be spending my time worrying about cache hits first. Maximize cache hits. Hmmm.
Now that I stop to think about it, my CPUID Explorer was done because I had to optimize
some numeric code based on cache sizes, and needed to know the cache architecture...and
yes, you can get an order of magnitude performance improvement. Without writing a single
line of assembly code, I got better than a factor of 10 improvement.

Assembly code is a last resort. The last three times I used it, I used it because I
needed to execute very low-level code, such as CPUID and RDTSC, not supported in the C
language.

Think algorithms, not code. Think architecture, not instructions.
joe

On Sun, 9 Mar 2008 19:46:23 +0100, "Kerem G�mr�kc�" <kareem114(a)hotmail.com> wrote:

>Hi Joseph,
>
>>Inline assembly code? Scary...
>
>I wont say scarry,..i rather would say a "necessity" for some
>operations that must be performed as fast as possible. You as
>a professional and very experienced developer should know
>what i am talking about, especially in the case of mathematical
>calculations and algorithms, there is nothing that can beat fast
>direct cpu uperations with dedicated instructions and fast register
>access. You can strip down your code and compact it as much
>as possible and write your own prolog and epilog and create
>your own operative mechanism. But you are right many many
>developers still "underrate" the power of assembly in high level
>languages and most of them still think, that assembly is difficult
>and something like black magic. A friend of mine, a pro
>.NET Developer always thinks that Assembly is dead. Thats
>"scary" to me. I also do .NET development, and very often,
>but there is nothing that can beat assembly and C/C++. Operating
>directly on hardware is the closest level to speed and fills the
>gap of the sometimes missed speed or missing features of
>runtime libraries. But on the other hand, if you are not used
>to assembly you can slow down someting that can be much
>faster with right high level code and good compiler optimization.
>This is the case if someone does not have a good knowledge of
>low level coding and has a lot of "intermediate" assembly instructions
>which cann e.g. solved with a special instruction in one row and
>cycle insted of throwing out a lot of moves, pushes, jumps and
>aritmethic operations. I always high recommend everybody to learn
>at least the basics of assembly language. Even if someone does
>not use asembly this makes you understand how function calls,
>memory operations and "natve debigging" works. I very good
>book on this was a book i read years ago from Joh Robbins,
>named "Debugging Windows". Highly recommended! I dont
>know whether this book is still up to date (new releases?)
>but even for today it is a very good book for learning how
>basic debugging works...
>
>
>Regards
>
>K.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Alexander Grigoriev on

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:vcg8t313hoeo4q5e39sff6rimjkltgbfe8(a)4ax.com...
>
> Assembly code is a last resort. The last three times I used it, I used it
> because I
> needed to execute very low-level code, such as CPUID and RDTSC, not
> supported in the C
> language.
>

And now there are __rdtsc and __cpuid intrinsics.


From: Joseph M. Newcomer on
Which are important, because the x64 compilers do not support assembly code insertions.
joe

On Sun, 9 Mar 2008 21:38:52 -0700, "Alexander Grigoriev" <alegr(a)earthlink.net> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>news:vcg8t313hoeo4q5e39sff6rimjkltgbfe8(a)4ax.com...
>>
>> Assembly code is a last resort. The last three times I used it, I used it
>> because I
>> needed to execute very low-level code, such as CPUID and RDTSC, not
>> supported in the C
>> language.
>>
>
>And now there are __rdtsc and __cpuid intrinsics.
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm