From: Peter Olcott on 21 Mar 2010 14:19 I have an application that uses enormous amounts of RAM in a very memory bandwidth intensive way. I recently upgraded my hardware to a machine with 600% faster RAM and 32-fold more L3 cache. This L3 cache is also twice as fast as the prior machines cache. When I benchmarked my application across the two machines, I gained an 800% improvement in wall clock time. The new machines CPU is only 11% faster than the prior machine. Both processes were tested on a single CPU. I am thinking that all of the above would tend to show that my process is very memory bandwidth intensive, and thus could not benefit from multiple threads on the same machine because the bottleneck is memory bandwidth rather than CPU cycles. Is this analysis correct?
From: Hector Santos on 21 Mar 2010 15:41 Geez, and here I was hoping you would get your "second opinion" from a more appropriate forum llke: microsoft.public.win32.programmer.kernel or one of the performance forums. Peter Olcott wrote: > I have an application that uses enormous amounts of RAM in a > very memory bandwidth intensive way. How do you do this? How much memory is the process loading? Show code that shows how intensive this is. Is it blocking memory access? I recently upgraded my > hardware to a machine with 600% faster RAM and 32-fold more > L3 cache. This L3 cache is also twice as fast as the prior > machines cache. What kind of CPU? Intel, AMD? If Intel, what kind of INTEL chips are you using? > When I benchmarked my application across the > two machines, I gained an 800% improvement in wall clock > time. The new machines CPU is only 11% faster than the prior > machine. Both processes were tested on a single CPU. Does this make sense to anyone? Two physical machines? > I am thinking that all of the above would tend to show that > my process is very memory bandwidth intensive, and thus > could not benefit from multiple threads on the same machine > because the bottleneck is memory bandwidth rather than CPU > cycles. Is this analysis correct? no. But if you believe your application has reaches his optimal design point and can not do any improved for machine performance, then you probably wasted money on improving your machine which will provide you no scalability benefits. At best, it will allow you to do your email, web browser and multi-task to other things while your application is chunking along at 100%. -- HLS
From: Joseph M. Newcomer on 21 Mar 2010 16:25 NOte in the i7 architecture the L3 cache is shared across all CPUs, so you are less likely to be hit by raw memory bandwidth (which compared to a CPU is dead-slow), and the answer s so whether multiple threads will work effectively can only be determined by measurement of a multithreaded app. Because your logic seems to indicate that raw memory speed is the limiting factor, and you have not accounted for the effects of a shared L3 cache, any opnion you offer on what is going to happen is meaningless. In fact, any opinion about performanance is by definition meaningless; only actual measurements represent facts ("If you can't express it in numbers, it ain't science, it's opinion" -- Robert A. Heinlein) More below... On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: >I have an application that uses enormous amounts of RAM in a >very memory bandwidth intensive way. I recently upgraded my >hardware to a machine with 600% faster RAM and 32-fold more >L3 cache. This L3 cache is also twice as fast as the prior >machines cache. When I benchmarked my application across the >two machines, I gained an 800% improvement in wall clock >time. The new machines CPU is only 11% faster than the prior >machine. Both processes were tested on a single CPU. *** The question is whether you are measuring multiple threads in a single executable image across multiple cores, or multiple executable images on a single core. Not sure how you know that both processes were tested on a single CPU, since you don't mention how you accomplished this (there are several techniques, but it is important to know which one you used, since each has its own implications for predicting overall behavior of a system). **** > >I am thinking that all of the above would tend to show that >my process is very memory bandwidth intensive, and thus >could not benefit from multiple threads on the same machine >because the bottleneck is memory bandwidth rather than CPU >cycles. Is this analysis correct? **** Nonsense! You have no idea what is going on here! The shared L3 cache could completely wipe out the memory performance issue, reducing your problem to a cache-performance issue. Since you have not conducted the experiment in multiple threading, you have no data to indicate one way or the other what is going on, and it is the particular memory access patterns of YOUR app that matter, and therefore, nobody can offer a meaningful estimate based on your L1/L2/L3 cache accessses, whatever they may be. joe **** > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 21 Mar 2010 16:38 More below... On Sun, 21 Mar 2010 15:41:39 -0400, Hector Santos <sant9442(a)nospam.gmail.com> wrote: >Geez, and here I was hoping you would get your "second opinion" from a >more appropriate forum llke: > > microsoft.public.win32.programmer.kernel > >or one of the performance forums. > >Peter Olcott wrote: > >> I have an application that uses enormous amounts of RAM in a >> very memory bandwidth intensive way. > > >How do you do this? > >How much memory is the process loading? > >Show code that shows how intensive this is. Is it blocking memory access? > > I recently upgraded my >> hardware to a machine with 600% faster RAM and 32-fold more >> L3 cache. This L3 cache is also twice as fast as the prior >> machines cache. > > >What kind of CPU? Intel, AMD? *** Actually, he said it is an i7 architecture some hundreds of messages ago.... **** > >If Intel, what kind of INTEL chips are you using? > >> When I benchmarked my application across the >> two machines, I gained an 800% improvement in wall clock > > > time. The new machines CPU is only 11% faster than the prior **** Based on what metric? Certainly, I hope you are not using clock speed, which is known to be irrelevant to performance. Did you look at the size of the i-pipe microinstrtuction cache on the two architectures? DId you look at the amount of concurrency in the execution engine (CPUs since 1991 have NOT executed instructions sequentially, they just maintain the illusion that they are). What about the new branch predictor in the i7 archietecture? CPU clock time is only comparable within a chipset family. It bears no relationship to another chipset family, particularly an older model, since most of the improvements come in the instruction and data pipelines, cache management (why do you think there is now an L3 cache in the i7s?) and other microaspects of the architecture. And if you used a "benchmark" program to ascertain this nominal 11% improvement, do you know what instruction sequence was being executed when it made the measurement? Probably not, but it turns out that's the level that matters. So how did you arrive at this magical number 11%? Note also that raw memory speed doesn't matter too much on real problems; cache management is the killer of performance, and the wrong sequence of address accesses will thrash your cache; and if you are modifying data it hurts even worse (a cache line has to be written back before it can be reused). caching read-only pages works well, and if you mark your data pages as "read only" after reading them in you can improve performance. But you are quoting perormance numbers here without giving any explanation of why you think they matter. joe **** > >> machine. Both processes were tested on a single CPU. > > >Does this make sense to anyone? Two physical machines? > >> I am thinking that all of the above would tend to show that >> my process is very memory bandwidth intensive, and thus >> could not benefit from multiple threads on the same machine >> because the bottleneck is memory bandwidth rather than CPU >> cycles. Is this analysis correct? **** Precisely because the bottleneck appears to be memory performance, and precisely because you have an L3 cache shared across all the chips, you are offering meanningless opinion here. the ONLY way to figure out what is going to happen is to try real experiments! And measure what they do. No amount of guesswork is going to tell you anything relevant, and you are guessing when it is clear you have NO IDEA what the implications of the i7 technology are. They are NOT just "faster memory" or "11% faster CPU" (whatever THAT means!). I downloaded the Intel docs and read them while I was working on my new multithreading course, and the i7 is more than a clock speed and a memory speed. joe **** > >no. > >But if you believe your application has reaches his optimal design >point and can not do any improved for machine performance, then you >probably wasted money on improving your machine which will provide you >no scalability benefits. > >At best, it will allow you to do your email, web browser and >multi-task to other things while your application is chunking along at >100%. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on 21 Mar 2010 20:07
Peter Olcott wrote: > I have an application that uses enormous amounts of RAM in a > very memory bandwidth intensive way. I recently upgraded my > hardware to a machine with 600% faster RAM and 32-fold more > L3 cache. This L3 cache is also twice as fast as the prior > machines cache. When I benchmarked my application across the > two machines, I gained an 800% improvement in wall clock > time. The new machines CPU is only 11% faster than the prior > machine. Both processes were tested on a single CPU. > > I am thinking that all of the above would tend to show that > my process is very memory bandwidth intensive, and thus > could not benefit from multiple threads on the same machine > because the bottleneck is memory bandwidth rather than CPU > cycles. Is this analysis correct? As stated numerous times, your thinking is wrong. But I don't fault you because you don't have the experience here, but you should not be ignoring what EXPERTS are telling you - especially if you never written multi-threaded applications. Attached C/C++ simulation (testpeter2t.cpp) illustrates how your single main thread process with a HUGE redundant memory access requirement is not optimized for a multi-core/processor machine and for any kind of scalability and performance efficiency. Compile the attach application. TestPeter2T.CPP will allow you to test: Test #1 - a single main thread process Test #2 - a multi-threads (2) process. To run the single thread process, just run the EXE with no switches: Here is TEST #1 V:\wc5beta> testpeter2t - size : 357913941 - memory : 1431655764 (1398101K) - repeat : 10 --------------------------------------- Time: 12297 | Elapsed: 0 | Len: 0 --------------------------------------- Total Client Time: 12297 The source code is set to allocate DWORD array with a total memory block of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50% CPU. Now this single process test provides the natural quantum scenario with a processdata() function: void ProcessData() { KIND num; for(int r = 0; r < repeat; r++) for (DWORD i=0; i < size; i++) num = data[i]; } By natural quantum, there is NO "man-made" interupts, sleeps or yields. The OS will preempt this as naturally it can do it every quantum. If you ran TWO single process installs like so: start testpeter2T start testpeter2T On my machine it is seriously degraded BOTH process because the HUGE virtual memory and paging requirements. The page faults were really HIGH and it just never completed and I didn't wish to wait because it was TOO obviously was not optimized for multiple instances. The memory load requirements was too high here. Now comes test #2 with threads, run the EXE with the /t switch and this will start TWO threads and here are the results: - size : 357913941 - memory : 1431655764 (1398101K) - repeat : 10 * Starting threads - Creating thread 0 - Creating thread 1 * Resuming threads - Resuming thread# 0 [000007DC] in 41 msecs. - Resuming thread# 1 [000007F4] in 467 msecs. * Wait For Thread Completion * Done --------------------------------------- 0 | Time: 13500 | Elapsed: 0 | Len: 0 1 | Time: 13016 | Elapsed: 0 | Len: 0 --------------------------------------- Total Time: 26516 BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design. I am going to recompile the code for 4 threads by changing: #define NUM_THREADS 4 // # of threads Lets try it: V:\wc5beta>testpeter2t /t - size : 357913941 - memory : 1431655764 (1398101K) - repeat : 10 * Starting threads - Creating thread 0 - Creating thread 1 - Creating thread 2 - Creating thread 3 * Resuming threads - Resuming thread# 0 [000007DC] in 41 msecs. - Resuming thread# 1 [000007F4] in 467 msecs. - Resuming thread# 2 [000007D8] in 334 msecs. - Resuming thread# 3 [000007D4] in 500 msecs. * Wait For Thread Completion * Done --------------------------------------- 0 | Time: 26078 | Elapsed: 0 | Len: 0 1 | Time: 25250 | Elapsed: 0 | Len: 0 2 | Time: 25250 | Elapsed: 0 | Len: 0 3 | Time: 24906 | Elapsed: 0 | Len: 0 --------------------------------------- Total Time: 101484 So the summary so far: 1 thread - 12 ms 2 threads - 13 ms 4 threads - 25 ms This is where you begin to look at various designs to improve things. There are many ideas but it requires a look at your actual work load. We didn't use a MEMORY MAP FILE and that MIGHT help. I should try that, but lets try a 3 threads run: #define NUM_THREADS 3 // # of threads and recompile, run testpeter2t /t - size : 357913941 - memory : 1431655764 (1398101K) - repeat : 10 * Starting threads - Creating thread 0 - Creating thread 1 - Creating thread 2 * Resuming threads - Resuming thread# 0 [000007DC] in 41 msecs. - Resuming thread# 1 [000007F4] in 467 msecs. - Resuming thread# 2 [000007D8] in 334 msecs. * Wait For Thread Completion * Done --------------------------------------- 0 | Time: 19453 | Elapsed: 0 | Len: 0 1 | Time: 13890 | Elapsed: 0 | Len: 0 2 | Time: 18688 | Elapsed: 0 | Len: 0 --------------------------------------- Total Time: 52031 How interesting!! To see how one thread got a near best case result. You can actually normalize all this can probably come how with a formula to guessimate what the performance with be with requests. But this is where WORKER POOLS and IOCP come into play and if you are using NUMA, the Windows NUMA API will help there too! All in all peter, this proves how multithreads, using shared memory is FAR superior then your misconceived idea that your application can not be resigned for multi-core/processor machine. I am willing to bet this simulator is for more stressful than your own DFA/OCR application in its work load. ProcessData() here is don't NO WORK at all but accessing memory. You will not be doing this, so the ODDS are very high you will run much more efficiently than this simulator. I want to hear you say "Oh My!" <g> -- HLS |