Prev: Calculating longword pointer, which method is faster ?
Next: SMT exploiting 21264-like clustering?
From: Seebs on 27 May 2010 16:04 On 2010-05-27, James Harris <james.harris.1(a)googlemail.com> wrote: > On 27 May, 17:06, "Skybuck Flying" <IntoTheFut...(a)hotmail.com> wrote: >> I think to get rid of a branch (branches slow down cpu's) and thereby speed >> up the code. If you need a branch for this, you've done it wrong. >> Would you rather write: >> // 1. >> Z := X shl Y; >> // or >> >> // 2. >> if Y < 32 then >> begin >> � � Z := X shl Y; >> end else >> begin >> � � Z := X; >> end; > If you are talking about x86 don't be afraid of branches. Instead, be > afraid of unpredictable branches. Further, from examples I've seen in > the past the processors can make a surprisingly good job of predicting > sequences we might consider random. For that matter, why are you writing something this low-level to begin with? And what on earth is it doing cross-posted to comp.lang.c, when it's obviously nothing to do with C? > It does help if you can present it stable sequences: longish runs of Y >< 32, longish runs of Y >= 32 in your example. Yes. It also helps if you think about the algorithm in advance, because it is patently obvious that shifting a word by more than the word size is meaningless, and means that the entire operation you're coming in from was nonsense. The problem with taking a long walk off a short pier isn't that walking is mistakenly producing undefined behavior. It's that the instructions are either wrong, or intentionally producing that behavior. -s -- Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
From: Seebs on 27 May 2010 16:02 On 2010-05-27, James Harris <james.harris.1(a)googlemail.com> wrote: > I'm not going to try and defend him but having seen his posts for some > time I don't think he's trolling. Could be. > His interests are or have been video > processing. He puts a lot of effort into getting the best x86 > instruction sequences from his Delphi compiler. The rest is mainly > communication style. Could be, but it's a communications style which seems rude to me, and the outrage at the idea that a processor might require you to give it only semantically valid instructions strikes me as a bad sign. -s -- Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
From: Skybuck Flying on 27 May 2010 22:59 "Seebs" <usenet-nospam(a)seebs.net> wrote in message news:slrnhvtkn9.i7r.usenet-nospam(a)guild.seebs.net... > On 2010-05-27, James Harris <james.harris.1(a)googlemail.com> wrote: >> On 27 May, 17:06, "Skybuck Flying" <IntoTheFut...(a)hotmail.com> wrote: >>> I think to get rid of a branch (branches slow down cpu's) and thereby >>> speed >>> up the code. > > If you need a branch for this, you've done it wrong. Perhaps... I was starting to wonder if there is a better way than using a lookup table. Today is a new today with a fresh look at things :) Now that the basic algorithm is done it's easier to see where problems might be. There are two or three different problems which are slightly different from each other. 1. The first problem is calculating a mask to represents all one's for the lowest bits up to bitcount. So if bitcount is 5, bits 0 to 4 most be set. Bit count can go up to 32 for a longword. Bit count could even go up to 64 for a int64. But this case can be ignored for now. So the limit of bit count 32 is good enough for now. Since shr 32 and shl 32 is not possible a solution is needed. A fast one too for that matter. Noone has yet provided one in there examples, perhaps because the examples where limited to 16 bit words. Anyway since there are only 5 bits available for shl and shr there is a logical problem for intel: "Do we support range 0 to 31 or do we support range 1 to 32 ?" Intel choose to support range 0 to 31. Logically/interestingly shr 0 and shl 0 do absolutely nothing ! Yet I have heard nobody complain about this apperently lack of functionality, which is just as useless as doing shr 32 and shl 32... The last could have even been more usefull but ok. Depends on the situation perhaps. The conclusion therefore can be/most be that shr 0 and shl 0 are usefull after all ! If "we" can wrap around the bitcount from 32 back to 0 than at least we will prevent "garbage". Now the question is, is it usefull to wrap around ? This will probably depend on the code and the situation. So let's examine it: The technique used so far to calculate a mask is to set the mask initially to all one's as follows: 1111111-32-one's-11111111 Depending on the bit count this masks needs to be shifted left. If bitcount is 1 the mask should have all one's except one zero as follows: 1111111-31 one's-11111110 The zero will represent the content bit later on. It's clear to me that with the current masking formula only 31 bits can be shifted out. Example: (1 shl 31)-1 = 0111 1111 1111 1111 1111 1111 1111 1111 (I just found a nice button on the windows calculator it's called Lsh (Left shift I guess) it can be used to do these calculations... nice) The problem is clear: the last bit is missing. I was thinking, maybe a "bitcount mod 32" might help so let's see what happens: 32 mod 32 would become zero. The formula will be: not ( -1 shl 0 ); -1 shl 0 = -1 -1 = all bits set. not (-1) = must therefore be logically all bits cleared. This is not what we want... we want all bits set for a bitcount of 32. Therefore there is clearly a problem. So we need a different solution. The algorithm exits on bitcount 0 so the mask of all one's is therefore useless. Perhaps the mask can be changed as follows: not ( -2 shl ? ); This would assume bitcount is always 1 or greater.... therefore we can subtract 1 from the bitcount. the formula would then be: not ( -2 shl (BitCount-1) ) Now let's see if this works for BitCount 1, 5 and 32 as verifications. Case 1: not ( -2 shl (1-1) ) = not ( - 2 shl 0 ) = not(-2) Logically all bits are set except the last one. (Assuming two complements computers, have a nice time figuring out how to support it in "portable C" ;) :)) Therefore it gets changed into: etc0000000000001 Which is indeed what we want. Now let's continue with the 32. Case 32: not( -2 shl (32-1) ) = not ( -2 shl 31 ) Logically all bits will be pushed out except the first one... So I would expect it to be all zero's... notting all zero's gives a mask of all one's. Which is what we want. For now the problem with "masking" seems to be solved. However the problem for shifting "content" still needs to be looked. I am not yet sure if it can be solved. Bye, Skybuck.
From: Skybuck Flying on 28 May 2010 12:01 > However the problem for shifting "content" still needs to be looked. I am > not yet sure if it can be solved. There is now one remaining problem with the algorithm: mSomething := Content shr (BitCount - Shift); The BitCount can again range from 0 to 32. The Shift can range from 0 to 31. Thus BitCount 32 - 0 is 32 So shr 32 is a problem. mSomething should become zero when shr 32 is done. Shr 0 will leave the content intact which would be wrong. Any solutions ? For now I only see a branch as a solution. Bye, Skybuck.
From: Skybuck Flying on 28 May 2010 12:02 "Skybuck Flying" <IntoTheFuture(a)hotmail.com> wrote in message news:82ee6$4bffe8c5$54190f09$1670(a)cache3.tilbu1.nb.home.nl... >> However the problem for shifting "content" still needs to be looked. I am >> not yet sure if it can be solved. > > There is now one remaining problem with the algorithm: > > mSomething := Content shr (BitCount - Shift); > > The BitCount can again range from 0 to 32. > > The Shift can range from 0 to 31. > > Thus BitCount 32 - 0 is 32 > > So shr 32 is a problem. > > mSomething should become zero when shr 32 is done. > > Shr 0 will leave the content intact which would be wrong. > > Any solutions ? > > For now I only see a branch as a solution. It's kinda tricky too... to try and get a branch working. Just checking if BitCount is 32 would not be enough.... Since BitShift might be 5 and then it needs to shift and so forth. So it actually needs to branches probably something like: if (mBitCount=32) and (mShift = 0) then begin mSomething := 0; end; Bye, Skybuck.
|
Next
|
Last
Pages: 1 2 Prev: Calculating longword pointer, which method is faster ? Next: SMT exploiting 21264-like clustering? |