Fast string functions [Computer Architecture]

Prev: VLIW pre-history
Next: what is CCIR-565

From: Tim McCaffrey on 14 Aug 2007 20:00

In article <gs29p4-k7n.ln1(a)osl016lin.hda.hydro.com>,
terje.mathisen(a)hda.hydro.com says...
>
>Piotr Wyderski wrote:
>> Terje Mathisen wrote:
>>
>>> They don't need anything like 256 comparators though:
>>
>> How is that possible? The documentation explicitely says
>> that every SIMD vector A item is compared with every
>> SIMD vector B item and then the results are merged
>
>I _really_ don't think that's what it means!
>
>In fact, I'm 99% positive this instruction does NOT entail any form of
>barrel shift/full crossbar to actually compare all bytes against all
bytes.
>
>What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
>b[x]) parallel byte compares, with the immediate mask determining how
>these intermediate results should be used/combined.

The Intel manual really does say something different.

http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro
gramming%20Reference-v1.002.pdf

(sorry about the wrap).

>
>First the W bit enabled merging of pairs of initial results, trivial
>except for the (< less than) which has to combine the byte == and <
tests:
>
> xw[n] < xw[n] = (xb[2n+1] < yb[2n+1]) ||
> ((xb[2n+1] == yb[2n+1]) && (xb[2n] < yb[2n]));
>
>> accordingly. Since there are up to 16 entries in a vector,
>> it means that 256 byte-wide comparisons must be performed.
>> Complete unrolling in space requires 256 byte-wide comparators
>> (or 16 128-bit-wide, if you prefer), which is a considerable
>> piece of silicon. Complete unrolling in time requires just
>> one 128-bit comparator and 16 passes, which is SLOW.
>
>Forget about that, it simply cannot be the case.
>>
>>> What it does look like is an opcode that can only ever be used as part
>>> of library code, or in the form of specific canned operations that are
>>> output from a compiler.
>>
>> But if an instruction is slow, then it is useless (provided that
>> there is a faster replacement). I am pretty sure that a carefully
>> tuned variant of pcmpeqb + ptest will give you much faster
>> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM?
>> No, thanks... :-)
>
>Assuming it works as I (now) expect, it should have latency comparable
>with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.
>

One way to implement the "equal any" mode would be using 256 byte compares
(the other way would be to implement the loop from the pseudo code).
However, you could also build a 256 bit wide lookup (1 bit wide) from one
xmm register, and index into it with each byte of the other xmm register.
Interesting instruction, I bet the range mode could also help with
audio/video compression. Also note that these instructions do not have an
alignment restriction.

- Tim

From: Terje Mathisen on 15 Aug 2007 06:27

John Mashey wrote:
> On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...(a)hda.hydro.com>
> wrote:
>> Piotr Wyderski wrote:
>>> Terje Mathisen wrote:
>>>> They don't need anything like 256 comparators though:
>
> BUT, the very FIRST thing to do is to profile a wide range of
> programs, see how much time is consumed by str* functions, and decide
> whether this is even worth arguing about [it might, or might not be;
> the last time I did this was along ago, but at the time, bcopy/memcpy
> was the only high-runner.]

This is of course the reason only REP MOVS has been targeted until now.
>
> The second thing, for code one owns, is to take a look and make sure
> there aren't silly things like for (i = 0; i <= strlen(s); i++)
> {stuff}, since many compilers won't move the strlen(s) out of the
> loop.

A bad programmer can write bad code in any language, the loop above
doesn't even come close. What about re-initializing an entire 3D
workspace, with all objects/lists/surfaces etc between every frame?

Even a "RealityEngine N+1" would have at least some problems at that
point. :-(
>
> Needless to say, profiling Dhrystone doesn't count :-)

No, there was no need to tell me that.

Anyway, with the extreme flexibility of those new opcodes, and assuming
they do run in "CMP reg,reg" time, it more or less becomes a
limited-functionality sub-processor. Parallel state machines?

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen on 15 Aug 2007 06:44

Tim McCaffrey wrote:
> In article <gs29p4-k7n.ln1(a)osl016lin.hda.hydro.com>,
> terje.mathisen(a)hda.hydro.com says...
>> Piotr Wyderski wrote:
>>> Terje Mathisen wrote:
>>>
>>>> They don't need anything like 256 comparators though:
>>> How is that possible? The documentation explicitely says
>>> that every SIMD vector A item is compared with every
>>> SIMD vector B item and then the results are merged
>> I _really_ don't think that's what it means!
>>
>> In fact, I'm 99% positive this instruction does NOT entail any form of
>> barrel shift/full crossbar to actually compare all bytes against all
> bytes.
>> What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] <
>> b[x]) parallel byte compares, with the immediate mask determining how
>> these intermediate results should be used/combined.
>
> The Intel manual really does say something different.
>
> http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro
> gramming%20Reference-v1.002.pdf

That was the manual I was reading, but not the initial part (2.3.1)
where it does state "up to 256 compares".

WOW!

What can I say? Mea Culpa. :-(

This is such a break with all cpu architecture design I've witnessed
over the last 20+ years, this is even more of an
"everything_including_the_kitchen_sink" opcode.

It must use a really big hunk of transistors, it is a true coprocessor,
tightly integrated with implicit registers/flags for input/output.

The way they've remapped all the flag register fields to allow a single
PCMP?STR* instruction to return all sorts of information at once is
quite amazing, it really looks like an attempt to make 8 and 16-bit
searching _very_ fast. I.e. this looks like FAST's original idea, where
my old "Algorithms and Data Structures" worked for years to have
students construct and implement a 16-wide parallel search engine.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Tim McCaffrey on 15 Aug 2007 15:44

In article <622bp4-3o4.ln1(a)osl016lin.hda.hydro.com>,
terje.mathisen(a)hda.hydro.com says...
>
>This is such a break with all cpu architecture design I've witnessed
>over the last 20+ years, this is even more of an
>"everything_including_the_kitchen_sink" opcode.
>
>It must use a really big hunk of transistors, it is a true coprocessor,
>tightly integrated with implicit registers/flags for input/output.
>
>The way they've remapped all the flag register fields to allow a single
>PCMP?STR* instruction to return all sorts of information at once is
>quite amazing, it really looks like an attempt to make 8 and 16-bit
>searching _very_ fast. I.e. this looks like FAST's original idea, where
>my old "Algorithms and Data Structures" worked for years to have
>students construct and implement a 16-wide parallel search engine.
>

I can't put my finger on why I feel this way, but I'm willing to make a
(small) bet that this is what Mr. Glew has been working on the last few
years. It just seems like his "style".

I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,
one (unrestricted alignment) memory access (or another XMM), and a GPR
and a Flags result all in one instruction. Must cause all kinds of
havoc syncronizing the execution pipes.

- Tim

From: Terje Mathisen on 16 Aug 2007 02:18

Tim McCaffrey wrote:
> I can't put my finger on why I feel this way, but I'm willing to make a
> (small) bet that this is what Mr. Glew has been working on the last few

I'd take that (very small) bet: I'll buy you a beer the next (first?)
time we meet if you're right. (If you're not, forget about it, I don't
like beer. :-)

> years. It just seems like his "style".

Andy, if this is indeed you, then I'd like (at some point in time) to
get the full story about what important workload could conceivably get
sufficient speedup from this.

> I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,

Which is why I don't entirely like to think about Andy behind it.

> one (unrestricted alignment) memory access (or another XMM), and a GPR
> and a Flags result all in one instruction. Must cause all kinds of
> havoc syncronizing the execution pipes.

Right.

This cannot really be a part of the usual pipes at all, it seem like a
huge hunk of semi-independent hardware, but with a private pipe in/out
of the register array.

The main hiccup wouldn't be the need to read potentially unaligned input
data, but the way it has to return information to three different
locations, all implied:

XMM0 for the SIMD data, ECX for the index found and the flags for all
sorts of branchable indications.

OTOH we've always had a few special-case instructions with extra
outputs: MUL and DIV are the most obvious, returning two int regs + flags.

Writing to an SSE register in parallel with a normal int return might
actually be easier than writing to two integer registers at once?

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: VLIW pre-history
Next: what is CCIR-565