Prev: Apple II Debugger
Next: TMA Assembler?
From: Phil Carmody on 20 Aug 2006 13:49 "Terry Russell" <trochilus(a)RoEpMtOuVsEnet.com.au> writes: > 256 byte lookup table > fill algorithm...30-400 clocks per byte Woh! you need a better fill algorithm. Can I offer you a 16-byte lookup table, and some back-to-basics shift-by-4 and add? Or the old paired LFSR technique. I say 'old', but as far as I know no-one's ever used it. When searching the literature for one possible application, I was unable to find any reference to it amongst dozens of techniques in hundreds of papers. Spin 2 in opposite directions, et voila. x86 Parity flag useful. Phil -- "Home taping is killing big business profits. We left this side blank so you can help." -- Dead Kennedys, written upon the B-side of tapes of /In God We Trust, Inc./.
From: Terry Russell on 20 Aug 2006 14:48 "Phil Carmody" <thefatphil_demunged(a)yahoo.co.uk> wrote in message news:87sljri9gz.fsf(a)nonospaz.fatphil.org... > "Terry Russell" <trochilus(a)RoEpMtOuVsEnet.com.au> writes: >> 256 byte lookup table >> fill algorithm...30-400 clocks per byte > > Woh! you need a better fill algorithm. > > Can I offer you a 16-byte lookup table, and some back-to-basics > shift-by-4 and add? > > Or the old paired LFSR technique. I say 'old', but as far as > I know no-one's ever used it. When searching the literature for > one possible application, I was unable to find any reference to > it amongst dozens of techniques in hundreds of papers. Spin > 2 in opposite directions, et voila. x86 Parity flag useful. The fill was a guess of the range of algorithms proferred, probably more like 50 without trying too hard. I was mostly noting it is comparable to the time to load a precalculated table from disk and that a lookup becomes independent of fill algorithm very quickly. Of course you could speed up the filler , but if it used seldom it isn't worth it, and if it is used often it isn't worth it. 128k of common byte operations precalculated in exe, getting towards 1 cents of ram and 0.01 cent of disk space not a real budget breaker Byte/bit optimisation of microcent per day resources is fine but the greater efficiency gain may be in allowing the kilobuck a week user to make full use of their time , multikilobuck resources and kilobuck a quarter support costs. Take window shifting and bitmaps, huge chunks of data write a lot of slow tiling code for generic systems, or specify (demand) at least a $50 later model card with gigapixel bitmap support and have the hardware do your 20kx20k bitmap transform?
From: Crazy on 21 Aug 2006 05:12 OK.... So the reason this is even needed is because of the Motorola 68000 series CPU's. I found this to be required when working with older versions of Novell NetWare API's Intel = LSB Motorola = MSB In 99% of cases, the BIT order of a BYTE is BIG ENDIAN Read THESE: http://www.cs.umass.edu/~verts/cs32/endian.html or http://www.webopedia.com/TERM/b/big_endian.html or http://en.wikipedia.org/wiki/Endianness Now I haven't tested this, but this should work, I think. Function ReturnBE(value:word) : word ; var MyWord : word StoreBE : word ; begin MyWord := 10241 ; StoreBE := $0000 ; StoreBE := lo(MyWord); StoreBE shl 8 ; StoreBE := StoreBE + Hi(MyWord) ; end; So ( I think ) this would look like in Binary --> MyWord := 10241 ; result in bin = 00101000:00000001 --> StoreBE = $0000 ; result in bin = 00000000:00000000 --> StoreBE := Lo(MyWord) ; result in bin = 00000000:00000001 --> StoreBE SHL 8 ; result in bin = 00000001:00000000 --> StoreBE := StoreBE + HI(MyWord); result in bin = 00000001:00101000 Now if you get into 32 bit numbers this wont work because you have: ------ LSB ------:-------MSB ------ ---LSB--:---MSB--:---LSB--:---MSB-- 00000000:00000000:00000000:00000000 Now if you get into 64 bit numbers this wont work because you have: ----------------LSB----------------:----------------MSB --------------- ------ LSB ------:-------MSB ------:------ LSB ------:-------MSB ------ ---LSB--:---MSB--:---LSB--:---MSB--:---LSB--:---MSB--:---LSB--:---MSB-- 00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000 At least >> I THINK << but I could damn well be wrong on the 32 and 64 bit part. I am not sure if the processor see's them as pure bits or as groups of bits ( bytes ) that make up a pure 32 bit or 64 bit number. Man am I sleepy. - Crazy Skybuck wrote: > Woops, sorry, I did not know comp.lang.asm.x86 is a moderated > newsgroup, as I don't like moderated newsgroups I post it again and > exclude comp.lang.asm.x86: > > Reversing bit order in delphi ? > > Hello, > > I think intel uses most significant bit first ? > > However there are some computers which use least significant bit first. > > So what would be the fastest way in pure delphi code to reverse the > bits ? > > For example: > > [intel system, most significant bit first] > > Bit order: > > 7, 6, 5, 4, 3, 2, 1, 0 > > 128, 64, 32, 16, 8, 4, 2, 1 > 0, 1, 1, 0, 1, 1, 0, 1 = 109 > > Reversing the bits would give: > 1, 0, 1, 1, 0, 1, 1, 0 = 182 > > [ibm system or something else, least significant bit first] > > Bit order: > > 0, 1, 2, 3, 4, 5, 6, 7 > > 1, 2, 4, 8, 16, 32, 64,128 > 0, 1, 1, 0, 1, 1, 0, 1 = 182 > > Reversing the bits would give: > 1, 0, 1, 1, 0, 1, 1, 0 = 109 > > Reversing the bits in delphi code might be too slow, if that's the case > is there a special (intel/amd) assembler instruction which might do the > trick ? > > Note: > > This message/thread is cross posted to: > > alt.comp.lang.borland-delphi, alt.lang.asm > > Bye, > Skybuck. >
From: [Jongware] on 21 Aug 2006 05:39 "Skybuck" <skybuck2000(a)hotmail.com> wrote in message news:1156016326.599168.43620(a)75g2000cwc.googlegroups.com... >> xlat. >> At the cost of a 256-byte table, it is easiest and fastest to just look >> the >> byte up. It even used to be a well-known trick (you can mirror bitmap >> images >> this way). >> As a side note I might add that though 'xlat' *is* a special instruction >> for >> table lookup, it's usually faster (!) to use the regular memory read >> functions. (That's from hear-say!) >> >> [Jongware] > > Euhm can you provide some assembler code ? > > Then maybe I can enter you into the competition ! ;) This is a winner every time. ..mirrortable db 0, 80h, 40h, 0c0h, 20h, 0a0h, 60h, 0e0h, 10h, 90h ... (etc. -- rest of look-up table omitted) -- and whereever you need the mirrored byte of AL xor ebx,ebx mov bl,al mov al, ds:[mirrortable+ebx] -- or, whatever register you need the data in. You don't have to clear the offset register ebx every time, but, since it can be a source of weird bugs, I've included it for clarity. Requirement: the table itself (256 bytes) Cost: As near to zero as you can imagine. Used this particular trick in my ZX Spectrum era, with its monochrome screen. Drawing any image mirrored took only 1 cpu clock tick per byte more than drawing the original. The Intel xlat command does the same in a single instruction, but I understand it's being 'phased out' and translated internally to the sequence I describe above (and at the cost of a higher clock count). [Jongware]
From: Skybuck on 22 Aug 2006 07:05
[Jongware] schreef: > "Skybuck" <skybuck2000(a)hotmail.com> wrote in message > news:1156016326.599168.43620(a)75g2000cwc.googlegroups.com... > >> xlat. > >> At the cost of a 256-byte table, it is easiest and fastest to just look > >> the > >> byte up. It even used to be a well-known trick (you can mirror bitmap > >> images > >> this way). > >> As a side note I might add that though 'xlat' *is* a special instruction > >> for > >> table lookup, it's usually faster (!) to use the regular memory read > >> functions. (That's from hear-say!) > >> > >> [Jongware] > > > > Euhm can you provide some assembler code ? > > > > Then maybe I can enter you into the competition ! ;) > > This is a winner every time. Are you sure ? I need proof :) > > .mirrortable db 0, 80h, 40h, 0c0h, 20h, 0a0h, 60h, 0e0h, 10h, 90h > ... (etc. -- rest of look-up table omitted) > > -- and whereever you need the mirrored byte of AL > xor ebx,ebx > mov bl,al > mov al, ds:[mirrortable+ebx] Delphi only uses 2 instructions for SwapBits := LookUp[Byte]; It generates something like: movzx ax, byte movzx ax, [table+ax] or something like that. That's two instructions versus your three instructions. So I doubt your three instructions would be faster, unless it can be paired ? By maybe the other two are paired as well... I don't know... probably not since they rely on each other ;) > The Intel xlat command does the same in a single instruction, but I > understand it's being 'phased out' and translated internally to the sequence > I describe above (and at the cost of a higher clock count). xlat is slower than the two delphi generated instructions, see test program ;) Ofcourse more testing should be done. For example in the future I might test "swapping" a whole array of bytes... to see if the performance chart remains the same ;) For now I have utter urgent matters to attend to, so I am not going to test your method for now... maybe in the future though... just for curiosity sake ;) I expect your method to be 1% slower than the leading method :) However maybe your method will be even slower because you are using EBX which needs to be pushed and popped from the stack.... so let's make that 2% ;) Bye, Skybuck. |