Prev: Apple II Debugger
Next: TMA Assembler?
From: Skybuck on 22 Aug 2006 07:07 Phil Carmody schreef: > "Terry Russell" <trochilus(a)RoEpMtOuVsEnet.com.au> writes: > > 256 byte lookup table > > fill algorithm...30-400 clocks per byte > > Woh! you need a better fill algorithm. > > Can I offer you a 16-byte lookup table, and some back-to-basics > shift-by-4 and add? > > Or the old paired LFSR technique. I say 'old', but as far as > I know no-one's ever used it. When searching the literature for > one possible application, I was unable to find any reference to > it amongst dozens of techniques in hundreds of papers. Spin > 2 in opposite directions, et voila. x86 Parity flag useful. Bring it on dude. Show us some pseudo code, or real code. Bye, Skybuck.
From: [Jongware] on 22 Aug 2006 08:00 "Skybuck" <skybuck2000(a)hotmail.com> wrote in message news:1156244725.780172.107680(a)m73g2000cwd.googlegroups.com... > > [Jongware] declared that: >> This is a winner every time. > > Are you sure ? > > I need proof :) > >> >> .mirrortable db 0, 80h, 40h, 0c0h, 20h, 0a0h, 60h, 0e0h, 10h, 90h >> ... (etc. -- rest of look-up table omitted) >> >> -- and whereever you need the mirrored byte of AL >> xor ebx,ebx >> mov bl,al >> mov al, ds:[mirrortable+ebx] > > Delphi only uses 2 instructions for > > SwapBits := LookUp[Byte]; > > It generates something like: > > movzx ax, byte > movzx ax, [table+ax] > > or something like that. > > That's two instructions versus your three instructions. > > So I doubt your three instructions would be faster, unless it can be > paired ? No competition here -- I meant to clarify what the xlat opcode itself does. I was racking my brain for how to load the low byte into a 4-byte register. while I was pondering s.o. in the other thread supplied 'movzx', and I guess that's what most compilers would translate a byte-read to. My point on the xlat / read-from-table method is that it is hundreds of times faster than *calculating* the value every time you need it. In effect, it is so much faster you don't even have to resort to an assembler routine any more! But I take it the original discussion has digressed :) "Are you also divergent, friend?" -- Twelve Monkeys [jw]
From: Shark8 on 22 Aug 2006 10:50 Here you are Skybuck, my version on the bit-reversal. It's pure Delphi, and admitedly similar to some that have been submitted (but then there's only a limited number of ways to do a specific task correctly), but may be insightful. ============================================================= program BitReverse; {$APPTYPE CONSOLE} uses SysUtils; type TBitMask = set of 0..7; Procedure WriteByte( const B:Byte ); var Index : Byte; Bit : TBitMask absolute B; begin for Index:= 7 downto 0 do if (Index in Bit) then write( '1' ) else write( '0' ); Writeln; end; Function ReverseBits( Const B: Byte ): Byte; var Index : Byte; OriginalBits : TBitMask absolute B; ResultBits : TBitMask absolute Result; begin Result:= $FF; // Initialize Result to all present for Index:= 7 downto 0 do // Loop through all 8 bits if NOT (Index in OriginalBits) then // If a bit is NOT present... { ResultBits:= ResultBits - [7-Index]; // exclude it at the mirror pos. } Exclude( ResultBits, 7-Index ); // Same as above, but quicker. end; var TestVal : Byte = $AF; begin WriteByte( TestVal ); // Write origional value. WriteByte( ReverseBits(TestVal) ); // Write reversed value. Readln; // Insert pause for readability. end.
From: Andreas Kochenburger on 22 Aug 2006 11:22 On 22 Aug 2006 07:50:17 -0700, "Shark8" <OneWingedShark(a)gmail.com> wrote: >Here you are Skybuck, my version on the bit-reversal. It's pure Delphi, >and admitedly similar to some that have been submitted (but then >there's only a limited number of ways to do a specific task correctly), >but may be insightful. For fun here's a pretty fast algo for 32-bit in C which compiles easily into efficient machine code. unsigned int revere (register unsigned int x) { x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1)); x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2)); x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4)); x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8)); return((x >> 16) | (x << 16)); } A linear lookup-table lookup would be impossibly large. A comparison of a similar code generated by gcc versus Delphi, GnuPascal and FreePascal would be interesting. Who could do that? Andreas ------- 1 + 1 = 3, for large values of 1.
From: Skybuck on 22 Aug 2006 11:23
f0dder schreef: > Skybuck wrote: > > f0dder schreef: > > > >> A 256-byte LUT can easily fit in any CPU L1 cache. Typically a > >> cacheline is 32 bytes long, so you'd maximally get (256/32) 8 cache > >> misses before the entire LUT is cached. I'm pretty sure that this is > >> hard to beat no matter which fancy instruction set you're going to > >> use... > > > > Hello, > > > > Thank you for trying to explain it. > > > > I still do not quite understand it completely maybe you can clearify > > something: > > > > Does the above theory only apply to the xlat instruction or any > > instruction/memory access that uses the lookup table technique ? > > Any instruction that references memory will cause an entire cache line to be > read (or written?). Iirc it's been like this since at least the 486. Things > are a bit more complex than this, though, especially on recent > architectures... check the intel and AMD docs, and Agner Fog's optimization > manuals. What exactly is a cache line ? Is this like a "line of memory" in the neighbourhood of where the data from/to the memory address was read/written ? Bye, Skybuck. |