From: robertwessel2 on 8 Mar 2007 21:32 On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote: > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > wrote: > > > > > > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > > > wrote: > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo..com> > > > > > wrote: > > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > > Hi guys > > > > > > > > i´m looking for some table or any documentation that can contains the > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > > > Someone have a link containing those kind of informations ? > > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > > > information. > > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > > > english ? > > > > > > Best Regards, > > > > > > Guga > > > > > Okay, now I've read the document. I was pretty much correct. Here's > > > > how Intel defined throughput: > > > > > Throughput - The number of clock cycles required to wait before the > > > > issue > > > > ports are free to accept the same instruction again. > > > > > IOW, it's basically the number of instructions (of the same > > > > instruction) that can execute per given time (though they've described > > > > this as a period, it's still the same thing). > > > > Cheers, > > > > Randy Hyde > > > > Ok.. so, we can say that latecy+throughput are the number of clock > > > cycles that those instructions to work ? EXample, on the document it > > > says this (pg 443): > > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > > > So, the total amount of cycles this mnemonic takes are 12 ? > > > No. It means that if you issue a CVTTPD2DQ with the appropriate > > functional unit ready for more work, it'll finish in 10 clocks. You > > can issue additional CVTTPD2DQs every two clocks (the throughput) > > without stalling things, and the results will pop out the other end > > every two clocks, but delayed (the latency) 10 clocks from when they > > entered the functional unit. Obviously being able to pipeline five > > CVTTPD2DQs requires that they have no dependencies which will cause > > them to stall, and that nothing else in the instruction stream causes > > any stalls or prevent the appropriate reorderings and whatnot. > > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or > > you could issue 10 (assuming nothing stalls), and they'll finish in 28 > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely > > approaching the maximum possible throughput). > > Tks.. but i lost the math logic.. Why 28 ? How did you got the > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ? Assuming you're issuing them continuously (IOW, every two clocks), the first CVTTPD2DQ will finish after the tenth clock (having been issued on the first clock), the second after the 12th (issued on the third), the third after the 14th clock (issued on the fifth), and (skipping numbers four through nine) the tenth will finish after the 28th clock (having been issued on the 19th clock).
From: Guga on 8 Mar 2007 22:16 On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> wrote: > On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > wrote: > > > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > > > > wrote: > > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > > > > wrote: > > > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > > > Hi guys > > > > > > > > > i´m looking for some table or any documentation that can contains the > > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > > > > Someone have a link containing those kind of informations ? > > > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > > > > information. > > > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > > > > english ? > > > > > > > Best Regards, > > > > > > > Guga > > > > > > Okay, now I've read the document. I was pretty much correct. Here's > > > > > how Intel defined throughput: > > > > > > Throughput - The number of clock cycles required to wait before the > > > > > issue > > > > > ports are free to accept the same instruction again. > > > > > > IOW, it's basically the number of instructions (of the same > > > > > instruction) that can execute per given time (though they've described > > > > > this as a period, it's still the same thing). > > > > > Cheers, > > > > > Randy Hyde > > > > > Ok.. so, we can say that latecy+throughput are the number of clock > > > > cycles that those instructions to work ? EXample, on the document it > > > > says this (pg 443): > > > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > > > > So, the total amount of cycles this mnemonic takes are 12 ? > > > > No. It means that if you issue a CVTTPD2DQ with the appropriate > > > functional unit ready for more work, it'll finish in 10 clocks. You > > > can issue additional CVTTPD2DQs every two clocks (the throughput) > > > without stalling things, and the results will pop out the other end > > > every two clocks, but delayed (the latency) 10 clocks from when they > > > entered the functional unit. Obviously being able to pipeline five > > > CVTTPD2DQs requires that they have no dependencies which will cause > > > them to stall, and that nothing else in the instruction stream causes > > > any stalls or prevent the appropriate reorderings and whatnot. > > > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or > > > you could issue 10 (assuming nothing stalls), and they'll finish in 28 > > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely > > > approaching the maximum possible throughput). > > > Tks.. but i lost the math logic.. Why 28 ? How did you got the > > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ? > > Assuming you're issuing them continuously (IOW, every two clocks), the > first CVTTPD2DQ will finish after the tenth clock (having been issued > on the first clock), the second after the 12th (issued on the third), > the third after the 14th clock (issued on the fifth), and (skipping > numbers four through nine) the tenth will finish after the 28th clock > (having been issued on the 19th clock). Tks.. robert.. i think i got it.. Assuming i´m using them continuosly, i made a simple formula that shows the amount of clock cycles of those instructions used on such a way (continuosly) Clocks = Latency+(Throughput*N-1) N = Amount of instructions used (all of the same type), like the 1000 example you gave. Latency of the mnemonic Throughput of the mnemonic. Clocks = total amount of clocks of the sequence of the mnemonics used continuosly Is that it ? Best Regards, Guga
From: Guga on 8 Mar 2007 22:19 On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> wrote: > On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > wrote: > > > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > > > > wrote: > > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > > > > wrote: > > > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > > > Hi guys > > > > > > > > > i´m looking for some table or any documentation that can contains the > > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > > > > Someone have a link containing those kind of informations ? > > > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > > > > information. > > > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > > > > english ? > > > > > > > Best Regards, > > > > > > > Guga > > > > > > Okay, now I've read the document. I was pretty much correct. Here's > > > > > how Intel defined throughput: > > > > > > Throughput - The number of clock cycles required to wait before the > > > > > issue > > > > > ports are free to accept the same instruction again. > > > > > > IOW, it's basically the number of instructions (of the same > > > > > instruction) that can execute per given time (though they've described > > > > > this as a period, it's still the same thing). > > > > > Cheers, > > > > > Randy Hyde > > > > > Ok.. so, we can say that latecy+throughput are the number of clock > > > > cycles that those instructions to work ? EXample, on the document it > > > > says this (pg 443): > > > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > > > > So, the total amount of cycles this mnemonic takes are 12 ? > > > > No. It means that if you issue a CVTTPD2DQ with the appropriate > > > functional unit ready for more work, it'll finish in 10 clocks. You > > > can issue additional CVTTPD2DQs every two clocks (the throughput) > > > without stalling things, and the results will pop out the other end > > > every two clocks, but delayed (the latency) 10 clocks from when they > > > entered the functional unit. Obviously being able to pipeline five > > > CVTTPD2DQs requires that they have no dependencies which will cause > > > them to stall, and that nothing else in the instruction stream causes > > > any stalls or prevent the appropriate reorderings and whatnot. > > > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or > > > you could issue 10 (assuming nothing stalls), and they'll finish in 28 > > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely > > > approaching the maximum possible throughput). > > > Tks.. but i lost the math logic.. Why 28 ? How did you got the > > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ? > > Assuming you're issuing them continuously (IOW, every two clocks), the > first CVTTPD2DQ will finish after the tenth clock (having been issued > on the first clock), the second after the 12th (issued on the third), > the third after the 14th clock (issued on the fifth), and (skipping > numbers four through nine) the tenth will finish after the 28th clock > (having been issued on the 19th clock). Tks.. robert.. i think i got it.. Assuming i´m using them continuosly, i made a simple formula that shows the amount of clock cycles of those instructions used on such a way (continuosly) Clocks = Latency+Throughput*(N-1) N = Amount of instructions used (all of the same type), like the 1000 example you gave. Latency of the mnemonic Throughput of the mnemonic. Clocks = total amount of clocks of the sequence of the mnemonics used continuosly Is that it ? Best Regards, Guga
From: Guga on 8 Mar 2007 22:38 On Mar 8, 4:18 pm, //\\\\o//\\\\annabee <Wanna...(a)wannabee.org> wrote: > På Fri, 09 Mar 2007 00:35:47 +0100, skrev Guga <Guga...(a)gmail.com>: > > > Hi guys > > > i´m looking for some table or any documentation that can contains the > > clock cycles (and instruction lenght) of the mnemonics related to > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > Someone have a link containing those kind of informations ? > > > Best Regards, > > > guga > > Hi Guga. > > Why isnt this useful? > > < http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_do.... > > Tks wanabee... i downloaded it.. it don´t contains the latency or lenght of the bytes, but it is usefull. I also found this http://cdrom.amd.com/21860/updates/Optimization_Guide_Help/wwhelp/wwhimpl/common/html/wwhelp.htm?context=WebWorksHelpOptGuide&file=WebWorksHelpOptGuide-15-09.html Best Regards, Guga
From: robertwessel2 on 8 Mar 2007 22:50
On Mar 8, 9:19 pm, "Guga" <Guga...(a)gmail.com> wrote: > On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > wrote: > > > > > > > On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > wrote: > > > > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink..net> > > > > > wrote: > > > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > > > > > wrote: > > > > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > > > > Hi guys > > > > > > > > > > i´m looking for some table or any documentation that can contains the > > > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > > > > > Someone have a link containing those kind of informations ? > > > > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > > > > > information. > > > > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > > > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > > > > > english ? > > > > > > > > Best Regards, > > > > > > > > Guga > > > > > > > Okay, now I've read the document. I was pretty much correct. Here's > > > > > > how Intel defined throughput: > > > > > > > Throughput - The number of clock cycles required to wait before the > > > > > > issue > > > > > > ports are free to accept the same instruction again. > > > > > > > IOW, it's basically the number of instructions (of the same > > > > > > instruction) that can execute per given time (though they've described > > > > > > this as a period, it's still the same thing). > > > > > > Cheers, > > > > > > Randy Hyde > > > > > > Ok.. so, we can say that latecy+throughput are the number of clock > > > > > cycles that those instructions to work ? EXample, on the document it > > > > > says this (pg 443): > > > > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > > > > > So, the total amount of cycles this mnemonic takes are 12 ? > > > > > No. It means that if you issue a CVTTPD2DQ with the appropriate > > > > functional unit ready for more work, it'll finish in 10 clocks. You > > > > can issue additional CVTTPD2DQs every two clocks (the throughput) > > > > without stalling things, and the results will pop out the other end > > > > every two clocks, but delayed (the latency) 10 clocks from when they > > > > entered the functional unit. Obviously being able to pipeline five > > > > CVTTPD2DQs requires that they have no dependencies which will cause > > > > them to stall, and that nothing else in the instruction stream causes > > > > any stalls or prevent the appropriate reorderings and whatnot. > > > > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or > > > > you could issue 10 (assuming nothing stalls), and they'll finish in 28 > > > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely > > > > approaching the maximum possible throughput). > > > > Tks.. but i lost the math logic.. Why 28 ? How did you got the > > > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ? > > > Assuming you're issuing them continuously (IOW, every two clocks), the > > first CVTTPD2DQ will finish after the tenth clock (having been issued > > on the first clock), the second after the 12th (issued on the third), > > the third after the 14th clock (issued on the fifth), and (skipping > > numbers four through nine) the tenth will finish after the 28th clock > > (having been issued on the 19th clock). > > Tks.. robert.. i think i got it.. > > Assuming i´m using them continuosly, i made a simple formula that > shows the amount of clock cycles of those instructions used on such a > way (continuosly) > > Clocks = Latency+Throughput*(N-1) > > N = Amount of instructions used (all of the same type), like the 1000 > example you gave. > Latency of the mnemonic > Throughput of the mnemonic. > Clocks = total amount of clocks of the sequence of the mnemonics used > continuosly > > Is that it ? Basically yes. The complications are that the published latencies are often (usually) for a functional unit, and not a particular instruction. For example, two instructions might have the same latency and throughput, but will only execute on the same functional unit, in which case the two have to share the available throughput. Also, dependencies are an issue (for example the sequence "add eax,ebx / add edx,eax" can't execute in a single cycle (even though there is more than one integer FU) because the second instruction cannot execute until the result of the first one is available. Then there are other dependencies and resource limitations - for example a CPU might not be able to issue more than three instructions at once, which may limit the total possible throughput. Memory accesses are another complication. Obviously that's all quite implementation dependent. It's a rather complex field, and the entire Intel manual I references is a good resource, and is fairly interesting reading (if you like that sort of thing). |