Prev: Now its easy to become System Engineer - Get $43,000+ Salary/year
Next: Anyone going to Supercomputers '09 in Portland?
From: Anton Ertl on 7 Nov 2009 12:17 We recently got a Zotac IONATX A board with a 1600MHz Atom N330 CPU, which supports SMT (or "hyperthreading" in Intel's marketingspeak). We tested it using our LaTeX benchmark <http://www.complang.tuwien.ac.at/anton/latex-bench/>. It runs in 2.3s-2.4s (in 32-bit mode), about the same speed as a 900MHz Athlon, or a little faster than a 1066MHz PPC 7447A, or about 5 times slowr than a 3GHz Core 2 Duo. Then we tested the performance when other processes were running. With 4 hardware threads (2 cores with two threads each), we ran three processes doing "yes >/dev/null" and one process running our LaTeX benchmark. The results varied, but we saw user times of 5.5s and 6s for the LaTeX benchmark. Just for comparison, we turned off hyperthreading in the BIOS, and ran the same setup again (i.e., 3 yes processes and one latex process). This time we saw 2.3s-2.4s user time for the latex benchmark and 4.7s real time for the latex benchmark. So, at least for this benchmark setup, hyperthreading is a significant loss on the Atom. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Bernd Paysan on 7 Nov 2009 16:17 Anton Ertl wrote: > So, at least for this benchmark setup, hyperthreading is a significant > loss on the Atom. Probably not a real surprise. The Atom is in-order, and SMT probably helps when you have many cache misses. Cache misses in the LaTeX benchmark should be rare. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Anton Ertl on 8 Nov 2009 13:29 Bernd Paysan <bernd.paysan(a)gmx.de> writes: >Anton Ertl wrote: >> So, at least for this benchmark setup, hyperthreading is a significant >> loss on the Atom. > >Probably not a real surprise. The Atom is in-order, and SMT probably >helps when you have many cache misses. Cache misses in the LaTeX >benchmark should be rare. They certainly are on the machines (IIRC Athlon 64 and Pentium 4) where I measured cache misses. Ideally SMT would also help when the functional units are not completely utilized even with loads hitting the D-cache (which is probably quite frequent on an in-order machine), but I don't know if that's the case for the Atom. In any case, no speedup from SMT is one thing, but a significant slowdown is pretty disappointing. Unless you know that you run lots of code that benefits from SMT, it's probably better to disable SMT on the Atom. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on 8 Nov 2009 13:57 In article <2009Nov8.192936(a)mips.complang.tuwien.ac.at>, Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: >Bernd Paysan <bernd.paysan(a)gmx.de> writes: >>Anton Ertl wrote: >>> So, at least for this benchmark setup, hyperthreading is a significant >>> loss on the Atom. >> >>Probably not a real surprise. The Atom is in-order, and SMT probably >>helps when you have many cache misses. Cache misses in the LaTeX >>benchmark should be rare. > >They certainly are on the machines (IIRC Athlon 64 and Pentium 4) >where I measured cache misses. Ideally SMT would also help when the >functional units are not completely utilized even with loads hitting >the D-cache (which is probably quite frequent on an in-order machine), >but I don't know if that's the case for the Atom. > >In any case, no speedup from SMT is one thing, but a significant >slowdown is pretty disappointing. Unless you know that you run lots >of code that benefits from SMT, it's probably better to disable SMT on >the Atom. And not just on the Atom. I ran some tests on the Core i7, and got a degradation of throughput by using more threads. My limited experience is that applies to virtually anything where the bottleneck is memory accesses. There MAY be some programs where SMT helps with cache misses, but I haven't seen them. Where I think that it helps is with heterogeneous process mixtures; e.g. one is heavy on floating-point, another on memory accesses, and another on branching. I could be wrong, as that's based on as much guesswork as knowledge, but it matches what I know. Regards, Nick Maclaren.
From: "Andy "Krazy" Glew" on 8 Nov 2009 15:18
nmm1(a)cam.ac.uk wrote: > In article <2009Nov8.192936(a)mips.complang.tuwien.ac.at>, > Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: >> Bernd Paysan <bernd.paysan(a)gmx.de> writes: >>> Anton Ertl wrote: >>>> So, at least for this benchmark setup, hyperthreading is a significant >>>> loss on the Atom. >>> Probably not a real surprise. The Atom is in-order, and SMT probably >>> helps when you have many cache misses. Cache misses in the LaTeX >>> benchmark should be rare. >> They certainly are on the machines (IIRC Athlon 64 and Pentium 4) >> where I measured cache misses. Ideally SMT would also help when the >> functional units are not completely utilized even with loads hitting >> the D-cache (which is probably quite frequent on an in-order machine), >> but I don't know if that's the case for the Atom. >> >> In any case, no speedup from SMT is one thing, but a significant >> slowdown is pretty disappointing. Unless you know that you run lots >> of code that benefits from SMT, it's probably better to disable SMT on >> the Atom. > > And not just on the Atom. I ran some tests on the Core i7, and got > a degradation of throughput by using more threads. My limited > experience is that applies to virtually anything where the bottleneck > is memory accesses. There MAY be some programs where SMT helps with > cache misses, but I haven't seen them. > > Where I think that it helps is with heterogeneous process mixtures; > e.g. one is heavy on floating-point, another on memory accesses, and > another on branching. I could be wrong, as that's based on as much > guesswork as knowledge, but it matches what I know. This is interesting. What Nick says about heterogenous workloads is certainly true - e.g. a compute intensive non-cache missing thread to switch to when a memory intensive thread cache misses. aking L1 (Or, rather, that is always running, and which keeps running when the memory intensve thread cache misses.) However, in theory two memory intensive threads should be able to coexist - computing when the other thread is idle. E.g. two cache missing pointer chasing threads should be able to practically double throughput. (I've usually been on the other side of this argument, since as comp.arch knows I am the leading exponent of single threaded MLP architectures. My opponents in industry would usually say "Can't you just get MLP from TLP?" and I would have to say "Yes, but...".) That so many people find threading a lossage for memory intensive workloads (and it is not just these comp.arch posters - most people in the supercomputer community disable hyperthreading) implies a) workloads that are already highly MLP, e.g. throughput limited workloads b) lousy threading microarchitectures. Which is typical - so many Intel processors arbitrarily split the instruction window in half, giving half to the compute intensive threads which do not need the window, and only half to the cache missing thread which can use more. c) contention between threads - e.g. thrashing out of useful D$ state. It's ironic - take a long latency L3 cache miss to DRAM, and the chances of more such are increased - because the other threads, which may only be taking L1 misses to L2, are thrashing your state out of the caches. Positive feedback. |