Prev: Texture units as a general function
Next: Buy Cardizem Overnight Delivery on all Cardizem Orders
From: Bengt Larsson on 21 Dec 2009 05:01 "nedbrek" <nedbrek(a)yahoo.com> wrote: >Hello all, > >"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message >news:a9kqi5tp99eana4uoc2r9d0l998gpuu21g(a)4ax.com... >> Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote: >> >>>I have an Atom, and I tested with a parallell make (of an editor, >>>mg2a, in C). With all the files in memory, the make takes 14.4 >>>seconds. with make -j (make -j 3 or 4 seems the most efficient) it >>>takes 10.7 seconds. That is an improvement with 30-35 percent. >> >> Actually that is a bit stupid, since it improves beyond 2 threads. >> With two threads, I get 11.3 seconds, an improvment with 27%. > >I usually do a "make -j N", where N = cores * 1.5 or 2. Compiling often >gets stuck on disk (even if the source is in memory, and the final output is >memory [ramdisk?], are all the temporary outs in memory? what about staticly >linked libs?). Well, everything is cached in memory, but there are no other special arrangements. I should have something less disk-intensive, but I don't have anything handy.
From: nedbrek on 21 Dec 2009 10:06 Hello all., "Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message news:kghui59h6ea26452kbhc8b69rpo3tm4rab(a)4ax.com... > "nedbrek" <nedbrek(a)yahoo.com> wrote: > >>"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message >>news:a9kqi5tp99eana4uoc2r9d0l998gpuu21g(a)4ax.com... >>> Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote: >>> >>>>I have an Atom, and I tested with a parallell make (of an editor, >>>>mg2a, in C). With all the files in memory, the make takes 14.4 >>>>seconds. with make -j (make -j 3 or 4 seems the most efficient) it >>>>takes 10.7 seconds. That is an improvement with 30-35 percent. >>> >>> Actually that is a bit stupid, since it improves beyond 2 threads. >>> With two threads, I get 11.3 seconds, an improvment with 27%. >> >>I usually do a "make -j N", where N = cores * 1.5 or 2. Compiling often >>gets stuck on disk (even if the source is in memory, and the final output >>is >>memory [ramdisk?], are all the temporary outs in memory? what about >>staticly >>linked libs?). > > Well, everything is cached in memory, but there are no other special > arrangements. I should have something less disk-intensive, but I don't > have anything handy. I think that is a pretty good test. Parallel make is one of the few "real life" type of benchmarks that people actually use. I was just trying to explain why you'd get more speedup with more than 2 threads. Ned
From: Bengt Larsson on 21 Dec 2009 12:53 "nedbrek" <nedbrek(a)yahoo.com> wrote: >I think that is a pretty good test. Parallel make is one of the few "real >life" type of benchmarks that people actually use. I was just trying to >explain why you'd get more speedup with more than 2 threads. Exactly. It's easy to make micro-benchmarks. I already made some, so I can publish: First: This is an Acer Aspire One, N270 Atom 1600 MHz, running Cygwin under Windows XP. The compiler is gcc 4.3.2, with options -march=native -mfpmath=sse -O2. A simple multiply-add: int i; double sum; for (i=0; i<limit; i++) { sum = sum*0.5 + 10.0; } Single-thread: 318 MFlops, Two threads: 2*314=628 MFlops, Improvement in throughput from two threads: 97% ---- Unrolled: for (i=0; i<limit; i++) { sum1 = sum1*0.5 + 10.0; sum2 = sum2*0.5 + 10.0; sum3 = sum3*0.5 + 10.0; sum4 = sum4*0.5 + 10.0; } Single-thread: 976 MFlops, Two threads: 2*726=1452 MFlops, Improvement: 49% ---- Unrolled som more (to fill SSE registers): for (i=0; i<limit; i++) { sum1 = sum1*0.5 + 10.0; sum2 = sum2*0.5 + 10.0; sum3 = sum3*0.5 + 10.0; sum4 = sum4*0.5 + 10.0; sum5 = sum5*0.5 + 10.0; sum6 = sum6*0.5 + 10.0; } Single-thread: 1118 MFlops, Two threads: 2*793=1586 MFlops, Improvement: 42% With two threads, this is quite close to 1600 MFlops, which would be the maximum. The Atom can issue a floating-point double-precision multiply only every two cycles. The adds either double issue or issue in between. ---- Redo the last benchmark in single precision: Single-thread: 1888 MFlops, Two threads: 2*1052=2104 MFlops, Improvement: 11.4% The Atom can issue a single-precision fp multiply every cycle, so that limit goes away. This achieves more than 1 Flop/cycle in a single thread. In two threads, it's 1.3 Flops/cycle. ---- Conclusion: if you use SSE, unless the code is extremely well scheduled, you gain quite a lot from the second thread.
From: Bengt Larsson on 21 Dec 2009 13:28 And some more. Classic FP Math too: Scalar SSE FP Math 318 2*314, simple loop 976 2*726, unrolled by 4 1118 2*793, unrolled by 6 1888 2*1052, unrolled by 6, single precision Classic FP 318 2*314, simple loop 309 2*277, unrolled by 4 268 2*224, unrolled by 6 268 2*224, unrolled by 6, single precision Classic FP math doesn't like unrolling. I assume this is especially bad on an in-order processor.
From: nmm1 on 21 Dec 2009 13:39 In article <f0bvi5lpsqpp9s7otcivtga3sqn43tknbu(a)4ax.com>, Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote: > >Conclusion: if you use SSE, unless the code is extremely well >scheduled, you gain quite a lot from the second thread. Sorry, but no. Your testing is fine, but that conclusion does not follow. Even micro-benchmarks should bear some relationship to what real code does. The days when testing the floating-point performance alone indicated anything useful are long gone. Only the older of us now remember when Whetstones were a useful comparison of relative performance .... Experience with most forms of threading, especially SMT, is that whether it helps or not depends on memory accesses and not actual calculation. Regards, Nick Maclaren.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: Texture units as a general function Next: Buy Cardizem Overnight Delivery on all Cardizem Orders |