Prev: VLIW pre-history
Next: what is CCIR-565
From: Tim McCaffrey on 14 Aug 2007 20:00 In article <gs29p4-k7n.ln1(a)osl016lin.hda.hydro.com>, terje.mathisen(a)hda.hydro.com says... > >Piotr Wyderski wrote: >> Terje Mathisen wrote: >> >>> They don't need anything like 256 comparators though: >> >> How is that possible? The documentation explicitely says >> that every SIMD vector A item is compared with every >> SIMD vector B item and then the results are merged > >I _really_ don't think that's what it means! > >In fact, I'm 99% positive this instruction does NOT entail any form of >barrel shift/full crossbar to actually compare all bytes against all bytes. > >What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] < >b[x]) parallel byte compares, with the immediate mask determining how >these intermediate results should be used/combined. The Intel manual really does say something different. http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro gramming%20Reference-v1.002.pdf (sorry about the wrap). > >First the W bit enabled merging of pairs of initial results, trivial >except for the (< less than) which has to combine the byte == and < tests: > > xw[n] < xw[n] = (xb[2n+1] < yb[2n+1]) || > ((xb[2n+1] == yb[2n+1]) && (xb[2n] < yb[2n])); > >> accordingly. Since there are up to 16 entries in a vector, >> it means that 256 byte-wide comparisons must be performed. >> Complete unrolling in space requires 256 byte-wide comparators >> (or 16 128-bit-wide, if you prefer), which is a considerable >> piece of silicon. Complete unrolling in time requires just >> one 128-bit comparator and 16 passes, which is SLOW. > >Forget about that, it simply cannot be the case. >> >>> What it does look like is an opcode that can only ever be used as part >>> of library code, or in the form of specific canned operations that are >>> output from a compiler. >> >> But if an instruction is slow, then it is useless (provided that >> there is a faster replacement). I am pretty sure that a carefully >> tuned variant of pcmpeqb + ptest will give you much faster >> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM? >> No, thanks... :-) > >Assuming it works as I (now) expect, it should have latency comparable >with a 64-bit SUB or CMP, i.e. 1 or 2 cycles. > One way to implement the "equal any" mode would be using 256 byte compares (the other way would be to implement the loop from the pseudo code). However, you could also build a 256 bit wide lookup (1 bit wide) from one xmm register, and index into it with each byte of the other xmm register. Interesting instruction, I bet the range mode could also help with audio/video compression. Also note that these instructions do not have an alignment restriction. - Tim
From: Terje Mathisen on 15 Aug 2007 06:27 John Mashey wrote: > On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...(a)hda.hydro.com> > wrote: >> Piotr Wyderski wrote: >>> Terje Mathisen wrote: >>>> They don't need anything like 256 comparators though: > > BUT, the very FIRST thing to do is to profile a wide range of > programs, see how much time is consumed by str* functions, and decide > whether this is even worth arguing about [it might, or might not be; > the last time I did this was along ago, but at the time, bcopy/memcpy > was the only high-runner.] This is of course the reason only REP MOVS has been targeted until now. > > The second thing, for code one owns, is to take a look and make sure > there aren't silly things like for (i = 0; i <= strlen(s); i++) > {stuff}, since many compilers won't move the strlen(s) out of the > loop. A bad programmer can write bad code in any language, the loop above doesn't even come close. What about re-initializing an entire 3D workspace, with all objects/lists/surfaces etc between every frame? Even a "RealityEngine N+1" would have at least some problems at that point. :-( > > Needless to say, profiling Dhrystone doesn't count :-) No, there was no need to tell me that. Anyway, with the extreme flexibility of those new opcodes, and assuming they do run in "CMP reg,reg" time, it more or less becomes a limited-functionality sub-processor. Parallel state machines? Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen on 15 Aug 2007 06:44 Tim McCaffrey wrote: > In article <gs29p4-k7n.ln1(a)osl016lin.hda.hydro.com>, > terje.mathisen(a)hda.hydro.com says... >> Piotr Wyderski wrote: >>> Terje Mathisen wrote: >>> >>>> They don't need anything like 256 comparators though: >>> How is that possible? The documentation explicitely says >>> that every SIMD vector A item is compared with every >>> SIMD vector B item and then the results are merged >> I _really_ don't think that's what it means! >> >> In fact, I'm 99% positive this instruction does NOT entail any form of >> barrel shift/full crossbar to actually compare all bytes against all > bytes. >> What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] < >> b[x]) parallel byte compares, with the immediate mask determining how >> these intermediate results should be used/combined. > > The Intel manual really does say something different. > > http://softwarecommunity.intel.com/UserFiles/en-us/File/Intel%20SSE4%20Pro > gramming%20Reference-v1.002.pdf That was the manual I was reading, but not the initial part (2.3.1) where it does state "up to 256 compares". WOW! What can I say? Mea Culpa. :-( This is such a break with all cpu architecture design I've witnessed over the last 20+ years, this is even more of an "everything_including_the_kitchen_sink" opcode. It must use a really big hunk of transistors, it is a true coprocessor, tightly integrated with implicit registers/flags for input/output. The way they've remapped all the flag register fields to allow a single PCMP?STR* instruction to return all sorts of information at once is quite amazing, it really looks like an attempt to make 8 and 16-bit searching _very_ fast. I.e. this looks like FAST's original idea, where my old "Algorithms and Data Structures" worked for years to have students construct and implement a 16-wide parallel search engine. Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Tim McCaffrey on 15 Aug 2007 15:44 In article <622bp4-3o4.ln1(a)osl016lin.hda.hydro.com>, terje.mathisen(a)hda.hydro.com says... > >This is such a break with all cpu architecture design I've witnessed >over the last 20+ years, this is even more of an >"everything_including_the_kitchen_sink" opcode. > >It must use a really big hunk of transistors, it is a true coprocessor, >tightly integrated with implicit registers/flags for input/output. > >The way they've remapped all the flag register fields to allow a single >PCMP?STR* instruction to return all sorts of information at once is >quite amazing, it really looks like an attempt to make 8 and 16-bit >searching _very_ fast. I.e. this looks like FAST's original idea, where >my old "Algorithms and Data Structures" worked for years to have >students construct and implement a 16-wide parallel search engine. > I can't put my finger on why I feel this way, but I'm willing to make a (small) bet that this is what Mr. Glew has been working on the last few years. It just seems like his "style". I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM, one (unrestricted alignment) memory access (or another XMM), and a GPR and a Flags result all in one instruction. Must cause all kinds of havoc syncronizing the execution pipes. - Tim
From: Terje Mathisen on 16 Aug 2007 02:18
Tim McCaffrey wrote: > I can't put my finger on why I feel this way, but I'm willing to make a > (small) bet that this is what Mr. Glew has been working on the last few I'd take that (very small) bet: I'll buy you a beer the next (first?) time we meet if you're right. (If you're not, forget about it, I don't like beer. :-) > years. It just seems like his "style". Andy, if this is indeed you, then I'd like (at some point in time) to get the full story about what important workload could conceivably get sufficient speedup from this. > I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM, Which is why I don't entirely like to think about Andy behind it. > one (unrestricted alignment) memory access (or another XMM), and a GPR > and a Flags result all in one instruction. Must cause all kinds of > havoc syncronizing the execution pipes. Right. This cannot really be a part of the usual pipes at all, it seem like a huge hunk of semi-independent hardware, but with a private pipe in/out of the register array. The main hiccup wouldn't be the need to read potentially unaligned input data, but the way it has to return information to three different locations, all implied: XMM0 for the SIMD data, ECX for the index found and the flags for all sorts of branchable indications. OTOH we've always had a few special-case instructions with extra outputs: MUL and DIV are the most obvious, returning two int regs + flags. Writing to an SSE register in parallel with a normal int return might actually be easier than writing to two integer registers at once? Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching" |