From: Peter Olcott on 26 Mar 2010 11:29 I want to understand why I am not getting the memory speed that I am expecting. I wrote a very memory bandwidth intensive C++ program and it is reporting that the memory speed that I am getting is about 121 MB per second. MemTest86 and it showed: Intel Core-i5 750 2.67 Ghz (quad core) 32K L1 88,893 MB/Sec 256K L2 37,560 MB/Sec 8 MB L3 26,145 MB/Sec 8.0 GB RAM 11,852 MB/Sec Why is my memory intensive process only getting such a tiny fraction of the 11,852 MB/Sec memory bandwidth speed?
From: Paul on 26 Mar 2010 12:48 Peter Olcott wrote: > I want to understand why I am not getting the memory speed > that I am expecting. I wrote a very memory bandwidth > intensive C++ program and it is reporting that the memory > speed that I am getting is about 121 MB per second. > > MemTest86 and it showed: > Intel Core-i5 750 2.67 Ghz (quad core) > 32K L1 88,893 MB/Sec > 256K L2 37,560 MB/Sec > 8 MB L3 26,145 MB/Sec > 8.0 GB RAM 11,852 MB/Sec > > Why is my memory intensive process only getting such a tiny > fraction of the 11,852 MB/Sec memory bandwidth speed? > > According to the "CPUID" tab here, the Core i5 has a cache line size of 64 bytes. http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html As I understand it, the processor deals in cache line sized transactions. Your memory is dual channel, each DIMM is 8 bytes wide, two DIMMs is 16 bytes wide. A burst memory transfer would need to be 4 cycles worth, to populate a cache line. Additional time is needed, to prepare for the next data burst from memory. If we take 11852 / 64, that represents the number of burst transfers we could do in one second. That is 185 million per second. Now, imagine we do random access of one byte, over the entire memory space. This is a "cache busting" pattern. None of the data caches will be effective, because each attempt to access a single byte, results in the least recently used cache line being evicted, and then filled with a cache line sized chunk from the main memory. So you're doing a little worse than the "cache busting" rate, and for that, I haven't a potential explanation. Both Intel and AMD, should be able to provide you with architecture bibles and programmer optimization hints. These can make a great difference to program performance if followed. http://www.intel.com/products/processor/manuals/ PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory Fills" http://www.intel.com/Assets/PDF/manual/248966.pdf But if the things you do are inherently cache busters, such as event simulation for digital simulators or the like, then there may not be much of anything you can do. If you were writing your own version of Photoshop, then you'd see instances of the 11,852MB/sec bandwidth, when bringing in portions of a picture for processing. And if the nearest neighbours needed for some picture processing algorithm are in cache, you'll see great processing speeds there as well. Locality of access is important to performance. In some cases, you have to rewrite or re-order how you do things with memory, for best performance. If you're "cache busting", the processor can't fix that for you. Just a guess, Paul
From: Peter Olcott on 26 Mar 2010 13:41 "Paul" <nospam(a)needed.com> wrote in message news:hoiog0$1jp$1(a)news.eternal-september.org... > Peter Olcott wrote: >> I want to understand why I am not getting the memory >> speed that I am expecting. I wrote a very memory >> bandwidth intensive C++ program and it is reporting that >> the memory speed that I am getting is about 121 MB per >> second. >> >> MemTest86 and it showed: >> Intel Core-i5 750 2.67 Ghz (quad core) >> 32K L1 88,893 MB/Sec >> 256K L2 37,560 MB/Sec >> 8 MB L3 26,145 MB/Sec >> 8.0 GB RAM 11,852 MB/Sec >> >> Why is my memory intensive process only getting such a >> tiny fraction of the 11,852 MB/Sec memory bandwidth >> speed? > > According to the "CPUID" tab here, the Core i5 has a cache > line > size of 64 bytes. > > http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html > > As I understand it, the processor deals in cache line > sized transactions. > Your memory is dual channel, each DIMM is 8 bytes wide, > two DIMMs is > 16 bytes wide. A burst memory transfer would need to be 4 > cycles > worth, to populate a cache line. Additional time is > needed, to prepare > for the next data burst from memory. > > If we take 11852 / 64, that represents the number of burst > transfers > we could do in one second. That is 185 million per second. > > Now, imagine we do random access of one byte, over the > entire memory > space. This is a "cache busting" pattern. None of the data > caches > will be effective, because each attempt to access a single > byte, > results in the least recently used cache line being > evicted, and > then filled with a cache line sized chunk from the main > memory. > > So you're doing a little worse than the "cache busting" > rate, and > for that, I haven't a potential explanation. > > Both Intel and AMD, should be able to provide you with > architecture > bibles and programmer optimization hints. These can make a > great > difference to program performance if followed. > > http://www.intel.com/products/processor/manuals/ > > PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory > Fills" > > http://www.intel.com/Assets/PDF/manual/248966.pdf > > But if the things you do are inherently cache busters, > such as > event simulation for digital simulators or the like, then > there > may not be much of anything you can do. Except one thing: Understand why I am only getting 2/3 of the cache busting rate instead of 100% of the cache busting rate. Your explanation was very helpful. Before your explanation I thought that I was only getting 1% of the cache busting rate. I am accessing the data randomly across all of allocated memory, in 32-bit integer chunks. Allocated memory is between 400 MB and 1.5 GB. for (uint32 N = 0; N < Max; N++) num = Data[num]; // finite state machine Data is initalized with for (uint32 N = 0; N < Data.size(); N++) Data[N] = (rand() * rand()) % size; > > If you were writing your own version of Photoshop, then > you'd see > instances of the 11,852MB/sec bandwidth, when bringing in > portions > of a picture for processing. And if the nearest neighbours > needed > for some picture processing algorithm are in cache, you'll > see > great processing speeds there as well. Locality of access > is > important to performance. In some cases, you have to > rewrite > or re-order how you do things with memory, for best > performance. > If you're "cache busting", the processor can't fix that > for you. > > Just a guess, > Paul
From: Paul on 28 Mar 2010 07:45 Peter Olcott wrote: > "Paul" <nospam(a)needed.com> wrote in message > news:hoiog0$1jp$1(a)news.eternal-september.org... >> Peter Olcott wrote: >>> I want to understand why I am not getting the memory >>> speed that I am expecting. I wrote a very memory >>> bandwidth intensive C++ program and it is reporting that >>> the memory speed that I am getting is about 121 MB per >>> second. >>> >>> MemTest86 and it showed: >>> Intel Core-i5 750 2.67 Ghz (quad core) >>> 32K L1 88,893 MB/Sec >>> 256K L2 37,560 MB/Sec >>> 8 MB L3 26,145 MB/Sec >>> 8.0 GB RAM 11,852 MB/Sec >>> >>> Why is my memory intensive process only getting such a >>> tiny fraction of the 11,852 MB/Sec memory bandwidth >>> speed? >> According to the "CPUID" tab here, the Core i5 has a cache >> line >> size of 64 bytes. >> >> http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html >> >> As I understand it, the processor deals in cache line >> sized transactions. >> Your memory is dual channel, each DIMM is 8 bytes wide, >> two DIMMs is >> 16 bytes wide. A burst memory transfer would need to be 4 >> cycles >> worth, to populate a cache line. Additional time is >> needed, to prepare >> for the next data burst from memory. >> >> If we take 11852 / 64, that represents the number of burst >> transfers >> we could do in one second. That is 185 million per second. >> >> Now, imagine we do random access of one byte, over the >> entire memory >> space. This is a "cache busting" pattern. None of the data >> caches >> will be effective, because each attempt to access a single >> byte, >> results in the least recently used cache line being >> evicted, and >> then filled with a cache line sized chunk from the main >> memory. >> >> So you're doing a little worse than the "cache busting" >> rate, and >> for that, I haven't a potential explanation. >> >> Both Intel and AMD, should be able to provide you with >> architecture >> bibles and programmer optimization hints. These can make a >> great >> difference to program performance if followed. >> >> http://www.intel.com/products/processor/manuals/ >> >> PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory >> Fills" >> >> http://www.intel.com/Assets/PDF/manual/248966.pdf >> >> But if the things you do are inherently cache busters, >> such as >> event simulation for digital simulators or the like, then >> there >> may not be much of anything you can do. > > Except one thing: Understand why I am only getting 2/3 of > the cache busting rate instead of 100% of the cache busting > rate. Your explanation was very helpful. Before your > explanation I thought that I was only getting 1% of the > cache busting rate. > > I am accessing the data randomly across all of allocated > memory, in 32-bit integer chunks. Allocated memory is > between 400 MB and 1.5 GB. > for (uint32 N = 0; N < Max; N++) > num = Data[num]; // finite state machine > > Data is initalized with > for (uint32 N = 0; N < Data.size(); N++) > Data[N] = (rand() * rand()) % size; > > >> If you were writing your own version of Photoshop, then >> you'd see >> instances of the 11,852MB/sec bandwidth, when bringing in >> portions >> of a picture for processing. And if the nearest neighbours >> needed >> for some picture processing algorithm are in cache, you'll >> see >> great processing speeds there as well. Locality of access >> is >> important to performance. In some cases, you have to >> rewrite >> or re-order how you do things with memory, for best >> performance. >> If you're "cache busting", the processor can't fix that >> for you. >> >> Just a guess, >> Paul > > I took a look at the first source you released. I gave it a try in Knoppix (a Linux LiveCD distro). That distro is a 32 bit version. What I find bizarre about the behavior of the code, is when the size is 100,000,000 32 bit numbers, the time to do the "random walk" is 20.05 seconds 20.03 seconds 20.05 seconds 20.04 seconds The random walk completes precisely in the same time, every time. Now, if I switch to 200,000,000 32 bit numbers and recompile, I get 63.23 seconds (should be 40.1 seconds or so) 65.29 seconds 63.70 seconds 61.97 seconds The variance is larger. And as you've already noted, the larger array runs slower. Doubling the array, should have taken twice the time or 40.1 seconds. My hypothesis in this case, is this is related to the TLB and page tables. The kernel does seem to have some kind of option, to allow reservation of huge pages (but as far as I know, they're not really that big.). http://lwn.net/Articles/375098/ My processor has up to 32 TLB entries of 4MB size, which is still not enough to prevent TLB misses. I expect that is backed by L1 and L2 cache. The x86 uses hardware table walks, and my guess would be, if a TLB miss and associated table walk hits in the L1 or L2 cache, that might prevent a total disaster. So while 32 entries doesn't sound like a lot, maybe it still gets the benefit of being filled from one of the caches, rather than a more expensive table walk in main memory. I wanted to experiment some more, but I usually restrict myself to LiveCD environments. I'd have to install some version of Linux, if I wanted to rebuild the kernel and add stuff. I tried to download a 64 bit version of Ubuntu (on the chance that the wider range of TLB page sizes might do some good), but when I went to use Synaptic Package Manager to install g++ to compile the code, the repository wasn't wired up correctly. So that's all the testing I've managed to complete so far - just on a 32 bit OS, and not with any added kernel features. So whatever slows it down, there is some variance in the execution time as a result. The variance with the smaller array, is amazingly small. Paul
|
Pages: 1 Prev: reverse a fan Next: VDC is Interested in Your Engineering Experiences |