From: randyhyde@earthlink.net on 16 Mar 2006 12:16 o//annabee wrote: > På Thu, 16 Mar 2006 16:57:09 +0100, skrev randyhyde(a)earthlink.net > <randyhyde(a)earthlink.net>: > > > I see you've got your head buried in the same whole in the sand that > > Rene does. Ignoring reality just because you don't like it is a sign of > > insanity, you know? > > Well. Dont know, but I think its spelled "Hole" anyway. "Whole" is meaning > more like "Complete". Believe me, proof-reading posts to this newsgroup would be a waste of time. But if you want to play grammar police, you're in a very weak position to do so. > > Which reminds me. Where can we download the 6 non-trival masterful > applications you have written in assembly? Grab the examples download from the HLA downloads page on Webster. > I looked at at webster, but > couldnt find anything but christian resources, which I though wore odd. Anything but, eh? Learn the basics of a web browser, dude. > > > > So why are you complaining if you actually believe this? > > I m not complaining. But I think when you have such wivid imagination, you > should make this daydream more realistic. Unless of course this "breaking > others code component" is important to you. Yes. Making sure that HLA is the best possible assembly language, even if it means breaking a *few* source files out there that use constructs that, by experience, have been shown to be "less than the best" is important to me. Unlike Rene, I'm not so egotistical to believe that I got everything right the first time around. And I'm certainly not so stubborn as to argue that obvious design flaws (such as the inability to link static library modules in RosAsm) shouldn't be corrected. Of course, before you go off on how bad the HLA design might be, I might also point out that although there *have* been some changes to the language, most of them have been additions, rather than deletions. Though breaking user code is *definitely* a possibility in any change, in reality this occurs very rarely. Indeed, the latest go-round that Sevag is talking about is *not* in the language, but in the standard library; and that's a different animal altogether. Cheers, Randy Hyde
From: sevagK on 16 Mar 2006 21:25 randyhyde(a)earthlink.net wrote: > > > > > This was tested on an AMD 64, 3700+ running win2000, > > And mine was tested on a PIV running XP. > > Randy Hyde I have an AMD 64 3400+ and the 2 routines run with your routine just *slightly* faster. After 100 iterations of each loop, Wannabee's version falls behind by an average of about 400. Of course, when dealing with billions of iterations, the score adds up, but if it you're pinching cycles, you probably won't be using macros! -sevag.k www.geocities.com/kahlinor
From: o//annabee on 17 Mar 2006 00:17 P? Thu, 16 Mar 2006 17:57:44 +0100, skrev randyhyde(a)earthlink.net <randyhyde(a)earthlink.net>: > Well, I don't have your particular CPU, but when I run this code on a > PIV, here are the results that I get: > > My code Your Code > 8418 c76c > 8600 c76c > 8590 c7d0 > 8568 c9b4 > 8648 c970 I just ran the tests, with the identical code also on the AthlonXp. The tests show there that you are indeed correct on this one. The timings I get for the athlon is 4f65 7549 So, then this is true for an old AMD, and must have changed for the new one. Whats the ticks per seconds rate for the PIV ? > Your version seems to be about 50% slower than mine on a PIV. Again, I > don't have access to your CPU, so can I can't verify your numbers, but > if you look at the actual code generated by RosAsm for the two > routines: > Well, the difference becomes pretty obvious. What you're trying to tell > me is that a loop with 50% more instructions, that is, > is actually *faster*? Hmmm... I sure seems like *my* measurements are > a lot more intuitive. That is, the code with 50% more instructions > (your's) runs 50% slower. That AMD CPU is quite amazing indeed, if > this is really the case. Yes. In this case you're right. On older CPUs like the PIV and Athlon your code is faster. But who can explain the timings on the 64bits CPU? > If you look at the two pieces of disassembled code, I think that this > alone should scare people away from using macros if they want the > fastest possible code. And, btw, I want to emphasize *macros*, not > *RosAsm macros*. You get the same problem whether the macro was written > for RosAsm, MASM, HLA, FASM, or whatever. Ok. Anyways, I am bit surpriced that an old athlon could give better clockings than a PIV ? > What I *have* claimed is that MASM's implementation of "if" statements > is *better* than the macros that come with RosAsm. This is because MASM > is a bit smarter about this stuff. You will also discover that HLA's > "while" loop generates the "test for loop at the end" rather than the > same code that RosAsm generates. Now perhaps that fails to be better > code on your particular AMD CPU, I cannot verify that as I do not have > access to that CPU. But an inspection of the code and measurements that > I've made suggest that putting the branch at the bottom of the loop and > removing an extra jump is *much* better coding indeed. Yes. But on newer hardware it doesnt seem to make much diffrence. > Yes, not to mention your failure to serialize before the second rdtsc > in each example. I tried this of course, but it does nothing for the timings either this way or that, so I felt it uneeded to have them there. > But that still doesn't explain the 50% difference that > *I* see on a PIV. And the difference I see is right in line with the > number of instructions. Imagine that. Yes. You are correct. But who will explain that this larger code runs faster on an AMD64. > On *your* CPU, things like pairable instructions and branch prediction > *could* be why the two loops execute in a similar amount of time. It's > not like the PIV is a paragon of great microcoding. But it *really* > smells like you've made an error somewhere. I'd suggest that you try > putting several additional instructions in the loop and see what > happens then. That would counter any bizzarre instruction pairing > phenomenon that is going on. Ok. Code posted below. >> This was tested on an AMD 64, 3700+ running win2000, > And mine was tested on a PIV running XP. >> TestProc: >> >> cpuid >> rdtsc | push eax >> mov ecx D$n >> xor eax eax > > ;You've just discovered the problem with > ; relative local labels here. Do you see > ; the problem in this code? This is > ; *exactly* why I refused to put this > ; lame form of local labels into HLA. > ; Earlier assemblers I'd written > ; had relative local labels and I > ; saw this problem *far* too often. Yes, but this mean nothing here. This code is not running in a library and it is garantied to never execute with ecx = 0. Actually this is a guarantie. It _will_ never execute in that way. But yes, I should have seen it anyhow. > >> jecxz L0> >> Align 16 >> L0: >> >> add eax ecx >> dec ecx >> jnz L0< >> L0: >> > Another issue- Caching effects are not allowed for in this code. The > way you executed it, by running and the stopping, guarantees that the > code will *not* be in the cache when you run it. Unless I misunderstand what you mean, this is what happens to code like this in a real situation. > What you should > *really* do is run each code fragment in a loop a couple of times and > then use the last measurement. Ok. Posted below. > That way, everything is in cache and > you'll get more realistic readings. Indeed, the reason your timings may > be so close is because the memory subsystem on your PC is sub-par and > what you're really measuring is the amount of time it takes to read > data from main memory. I use Twinmos DDR 3200 dual 512 chips on a ABIT KN8 Ultra board with nvidia chipset. Surely not the fastest money can buy, but is what I could handle at the time. > Cheers, > Randy Hyde Code for timings with more instructions between the loop. And again, you are correct. When adding more instructions in the loop, your code gains a foothold and wins. Amazing. TestProc: cpuid rdtsc | push eax mov ecx 10000 xor eax eax Align 16 while ecx > 0 add eax ecx add edx ebx add esi edi add edx ebx add esi edi add edx ebx add esi edi add edx ebx add esi edi dec ecx End_While rdtsc | pop ebx sub eax ebx int 3 ;AFEF cpuid rdtsc | push eax mov ecx D$n xor eax eax jecxz L0> Align 16 L0: add eax ecx add edx ebx add esi edi add edx ebx add esi edi add edx ebx add esi edi add edx ebx add esi edi dec ecx jnz L0< L0: rdtsc | pop ebx sub eax ebx int 3 ;9CD7
From: o//annabee on 17 Mar 2006 00:39 P? Thu, 16 Mar 2006 18:15:56 +0100, skrev Betov <betov(a)free.fr>: > o//annabee <fack(a)szmyggenpv.com> ?crivait news:op.s6ik2cnqce7g4q(a)bonus: > Courage: Kill him ! Kill him ! I think this one will not stop talkinmg even in death. > At the end, he will point > you to the pathetic "HLA Advantures" game, that we suffer > since, now,... how many _years_ exactly?... Thats not assembly. So it doesnt count. > > :]]]]] > > Betov. > > < http://rosasm.org > > >
From: o//annabee on 17 Mar 2006 01:59
P? Fri, 17 Mar 2006 03:25:27 +0100, skrev sevagK <kahlinor(a)yahoo.com>: > > randyhyde(a)earthlink.net wrote: >> >> > >> > This was tested on an AMD 64, 3700+ running win2000, >> >> And mine was tested on a PIV running XP. >> >> Randy Hyde > > I have an AMD 64 3400+ and the 2 routines run with your routine just > *slightly* faster. Thats sound more and more incredible. In all tests here, on the new machine, the RosAsm code ran faster for all iterations, and the same diffrence applied for 10000 and 2 iterations. For the Athlon, the RosAsm version was nearly twice as slow. > After 100 iterations of each loop, Wannabee's version falls behind by > an average of about 400. What does this mean? Where are the code you tested, and are you to be understood that the code ran slower and slower? > Of course, when dealing with billions of iterations, the score adds up, > but if it you're pinching cycles, you probably won't be using macros! This are my timings on the AMD64 3700+ for 1_000_000_000 iterations Randall: 774c_e5f7 Mine: 774C_C9B3 > > -sevag.k > www.geocities.com/kahlinor > |