Fast string functions [Computer Architecture]

Prev: VLIW pre-history
Next: what is CCIR-565

From: Piotr Wyderski on 13 Aug 2007 19:32

Terje Mathisen wrote:

> There is one very important situation that Alpha did consider:
>
> If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from
> the end of the last allocated page, then you'll get a page fault by trying
> to read past the end of the page.
>
> Alpha handled this by having an instruction which could convert the lower
> three bits of an address into a mask, said mask to be used together with
> an aligned load (which it also had: Load from 0xnnnn3 and you would get
> the quadword starting at 0xnnnn0). This made sure that all load operations
> were aligned, and therefore couldn't cross a page boundary.
>
> With SSE you don't have a similarly fast way of turning a potentially
> misaligned load in an aligned load + mask, and since average C string
> lengths are just a few bytes (single digits afair from reading some paper
> 10+ years ago), the first block handling is _crucial_.

Yes, it is exactly the reason I convert all misaligned string base addresses
into down-aligned ones. This way you'll never cross a page boundary, as
the alignment process simply clears the four least significant bits of the
string address and can use the fast movaps instruction. All you need is
to mask out the "false positives" located before the actual string itself,
and the mask computation process can be performed using a simple
piece of general-purpose code which runs in parallel with the SSE2-based
comparator (namely, it is expected to slide under the pmovmskb latency).

Best regards
Piotr Wyderski

From: Piotr Wyderski on 13 Aug 2007 19:59

Terje Mathisen wrote:

> They don't need anything like 256 comparators though:

How is that possible? The documentation explicitely says
that every SIMD vector A item is compared with every
SIMD vector B item and then the results are merged
accordingly. Since there are up to 16 entries in a vector,
it means that 256 byte-wide comparisons must be performed.
Complete unrolling in space requires 256 byte-wide comparators
(or 16 128-bit-wide, if you prefer), which is a considerable
piece of silicon. Complete unrolling in time requires just
one 128-bit comparator and 16 passes, which is SLOW.

> What it does look like is an opcode that can only ever be used as part of
> library code, or in the form of specific canned operations that are output
> from a compiler.

But if an instruction is slow, then it is useless (provided that
there is a faster replacement). I am pretty sure that a carefully
tuned variant of pcmpeqb + ptest will give you much faster
strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
No, thanks... :-)

> I bet you could use this opcode plus a store and have a two-instruction
> Turing-equivalent machine, using self-modification for all interesting
> stuff. :-)

Frankly, I had a very similar impression and the term "overdesigned"
was the first (well, the first non-insulting...) word that came to mind.
:-)))

Best regards
Piotr Wyderski

From: Terje Mathisen on 14 Aug 2007 12:46

Piotr Wyderski wrote:
> Terje Mathisen wrote:
>
>> They don't need anything like 256 comparators though:
>
> How is that possible? The documentation explicitely says
> that every SIMD vector A item is compared with every
> SIMD vector B item and then the results are merged

I _really_ don't think that's what it means!

In fact, I'm 99% positive this instruction does NOT entail any form of
barrel shift/full crossbar to actually compare all bytes against all bytes.

What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
b[x]) parallel byte compares, with the immediate mask determining how
these intermediate results should be used/combined.

First the W bit enabled merging of pairs of initial results, trivial
except for the (< less than) which has to combine the byte == and < tests:

xw[n] < xw[n] = (xb[2n+1] < yb[2n+1]) ||
((xb[2n+1] == yb[2n+1]) && (xb[2n] < yb[2n]));

> accordingly. Since there are up to 16 entries in a vector,
> it means that 256 byte-wide comparisons must be performed.
> Complete unrolling in space requires 256 byte-wide comparators
> (or 16 128-bit-wide, if you prefer), which is a considerable
> piece of silicon. Complete unrolling in time requires just
> one 128-bit comparator and 16 passes, which is SLOW.

Forget about that, it simply cannot be the case.
>
>> What it does look like is an opcode that can only ever be used as part
>> of library code, or in the form of specific canned operations that are
>> output from a compiler.
>
> But if an instruction is slow, then it is useless (provided that
> there is a faster replacement). I am pretty sure that a carefully
> tuned variant of pcmpeqb + ptest will give you much faster
> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
> No, thanks... :-)

Assuming it works as I (now) expect, it should have latency comparable
with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: John Mashey on 14 Aug 2007 13:54

On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...(a)hda.hydro.com>
wrote:
> Piotr Wyderski wrote:
> > Terje Mathisen wrote:
>
> >> They don't need anything like 256 comparators though:

BUT, the very FIRST thing to do is to profile a wide range of
programs, see how much time is consumed by str* functions, and decide
whether this is even worth arguing about [it might, or might not be;
the last time I did this was along ago, but at the time, bcopy/memcpy
was the only high-runner.]

The second thing, for code one owns, is to take a look and make sure
there aren't silly things like for (i = 0; i <= strlen(s); i++)
{stuff}, since many compilers won't move the strlen(s) out of the
loop.

Needless to say, profiling Dhrystone doesn't count :-)

From: Piotr Wyderski on 14 Aug 2007 18:19

John Mashey wrote:

> BUT, the very FIRST thing to do is to profile a wide range of
> programs, see how much time is consumed by str* functions, and decide
> whether this is even worth arguing about [it might, or might not be;
> the last time I did this was along ago, but at the time, bcopy/memcpy
> was the only high-runner.]

In my case SIMD does help a lot, therefore I don't care about
"a wide range of programs". ;-) My programs should provide
the highest performance available, the remaining ones
can be slow (most of cases) or even should be slow
(our competitors)...

> The second thing, for code one owns, is to take a look and make sure
> there aren't silly things like for (i = 0; i <= strlen(s); i++)

Indeed, but this kind of code transformation it is the very first
optimization I do.

Best regards
Piotr Wyderski

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: VLIW pre-history
Next: what is CCIR-565