Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance
From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 03:34 Paul A. Clayton wrote: > On Jul 28, 5:32 pm, MitchAlsup<MitchAl...(a)aol.com> wrote: > [snip] >> The wide memory bus is invariably faster, especialy with small number >> of DIMMs. > > Wouldn't having twice as many potentially active DRAM banks (two > independent channels vs. two DIMM channels merged to a single > addressed channel) be a significant benefit for many multithreaded > and some single-threaded applications where bank conflicts might be > more common (especially with a "small number of DIMMs")? It would, unless the size of a combined channel is still less than or the same as a cache line: When a single channel can deliver 64 bits and a combined channel 128 bits, and a cache line (at 512+ bits) is the smallest unit of transfer, then you _want_ wider channels. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Benny Amorsen on 29 Jul 2010 07:21 Terje Mathisen <"terje.mathisen at tmsw.no"> writes: > Yes, but if you want to take a chance and skip the trailing checksum > test, in order to forward packets as soon as you have the header, then > you would have even more severe timing restrictions, right? There are several layers of checksums at play here. If we stick to IP, only the header has a checksum, and for IPv6 even that has been removed. So there isn't really a chance to take, because you have the checksum before you start receiving the payload (and the payload isn't protected). There is a whole-packet checksum at the ethernet level (if the physical layer happens to be ethernet, of course). Switches used to pretty much universally do cut-through switching until gigabit switches arrived. Almost all gigabit switches are store-and-forward, but somehow latency was rediscovered in 10Gbps-switches, so quite a few of those are cut-through. Unfortunately "cut-through routing" refers to something entirely different from "cut-through switching". I haven't been able to find any products claiming to do anything but store-and-forward routing. /Benny
From: MitchAlsup on 29 Jul 2010 08:45 On Jul 28, 7:40 pm, "Paul A. Clayton" <paaronclay...(a)embarqmail.com> wrote: > On Jul 28, 5:32 pm, MitchAlsup <MitchAl...(a)aol.com> wrote: > [snip] > > > The wide memory bus is invariably faster, especialy with small number > > of DIMMs. > > Wouldn't having twice as many potentially active DRAM banks (two > independent channels vs. two DIMM channels merged to a single > addressed channel) be a significant benefit for many multithreaded > and some single-threaded applications where bank conflicts might be > more common (especially with a "small number of DIMMs")? This was extensively simulated, and to our surprise:: The first data beat arrives at the same point in time on both the side and narrow arrangements. The last data beat arrives a lot longer afterwards on the narrow arrangement. It is the last data beat which governs the sending of the line through the Crossbar. This is not inevitably necessary, but it is on the crossbar in Opteron. Once the memory controller starts sending the first line, it cannot switch to another line and use the BW available in the fabric router. So, getting the whole line out into the fabric is the key, and why the dual DIMM bus does not work as well as one would expect. It is perfectly reasonable to build the fabric where this property is not a limiting factor and actually get better performance with more banks. Mitch
From: Thomas Womack on 30 Jul 2010 13:47 In article <5fb1774d-6056-4564-a6c8-4c9919a50cd7(a)j8g2000yqd.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >People who need to get the physics right know how to do it: > >http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf Are http://code.google.com/p/p3dfft/ and http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in this case? Large 3D FFTs decomposing over lots of processors into (N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs (2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a load of racks plus a monolithic 3456-port 40GBit Infiniband switch from Sun, so doesn't look that dissimilar to a national-labs machine. I suppose the question is what counts as awful - it's 5% of peak, but (figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which it's not ludicrous to call scalable. Tom
From: Terje Mathisen "terje.mathisen at on 31 Jul 2010 02:22
Thomas Womack wrote: > In article<5fb1774d-6056-4564-a6c8-4c9919a50cd7(a)j8g2000yqd.googlegroups.com>, > Robert Myers<rbmyersusa(a)gmail.com> wrote: > >> People who need to get the physics right know how to do it: >> >> http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf > > Are http://code.google.com/p/p3dfft/ and > http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in > this case? Large 3D FFTs decomposing over lots of processors into > (N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs > (2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a > load of racks plus a monolithic 3456-port 40GBit Infiniband switch > from Sun, so doesn't look that dissimilar to a national-labs machine. > > I suppose the question is what counts as awful - it's 5% of peak, but > (figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which > it's not ludicrous to call scalable. No, not at all. To me, the most interesting part of that paper was the way they had to tune their 2D decomposition for the actual core layout, i.e. with 16 cores/node the most efficient setup was with 4xN "pencils", almost certainly due to those 16 cores coming from 4 4-core CPUs. The other highly significant piece of information was that they got away with single-prec numbers! Using DP instead would double memory and communication sizes, while it would reduce FP throughput by an order of magnitude on something like a Cell or most GPUs. OTOH, this would still be fast enough to keep up with the communication network, right? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |