From: Hector Santos on 22 Mar 2010 14:30 Joseph M. Newcomer wrote: >> (1) People in a more specialized group are coming to the >> same conclusions that I have derived. > **** > How? I have no idea how to predice L3 cache performance on an i7 system, and I don't > believe they do, either. No theoretical model exists that is going to predict actual > behavior, short of a detailed simulation,and I talked to Intel and they are not releasing > performance statistics, period, so there is no way short of running the experiement to > obtain a meaningful result. > **** Have you seen the posted C/C++ simulator and proof that shows how using multiple threads and shared data trumps his single main thread process theory? >> (2) When a process requires essentially random (mostly >> unpredictable) access to far more memory than can possibly >> fit into the largest cache, then actual memory access time >> becomes a much more significant factor in determining actual >> response time. > **** > What is your cache collision ratio, actually? Do you really understand the L3 cache > replacement algorithm? (I can't find out anything about it on the Intel site! So I'm > surprised you have this information, which Intel considers Corporate Confidential) > **** Well, the thing is joe, is that this chip cache is something he will using. This application will be use the cache the OS maintains. He is thinking about stuff that he shouldn't be worry about. He thinks his CODE deals directly with the chip caches. -- HLS
From: Joseph M. Newcomer on 22 Mar 2010 14:32 See below... On Mon, 22 Mar 2010 10:31:17 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: > >"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message >news:%23Q4$1KdyKHA.404(a)TK2MSFTNGP02.phx.gbl... >> Joseph M. Newcomer wrote: >> >> >>> Note also if you use a memory-mapped file and two >>> processes share the same mapping object >>> there is only one copy of the data in memory! THis has >>> not previously come up in >>> discussions, but could be critical to your performance of >>> multiple processes. >>> joe >> >> >> He has been told that MMF can help him. >> >> -- >> HLS > >Since my process (currently) requires unpredictable access >to far more memory than can fit into the largest cache, I >see no possible way that adding 1000-fold slower disk access >could possibly speed things up. This seems absurd to me. **** He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates total and complete cluelessness, plus a startling inabilitgy to understand that we are making USEFUL suggestions because WE KNOW what is going on and he has no idea. Like you, I'm giving up. There is only so long you can beat someone over the head with good ideas which they reject because they have no idea what you are talking about, but won't expend any energy to learn about, or ask questions about. Since he doesn't understand what shared sections are, or what they buy, and that a MMF is the way to get shared sections, I'm dropping out of this discussion. He has found a set of "experts" who agree with him (your example apparently doesn't convey the problem correctly), thinks memory-mapped files limit access to disk speed (not even understanding they are FASTER than ReadFile!) and has failed utterly to understand even the most basic concepts of an operagin system (thinking it is like an automatic transmission, where you can use it without knowing or caring about how it works, when what he is really doing is trying to build a competition racing machine and saying "all that stuff about the engine is irrelevant", whereas anyone who does competition racing (like my next-door neighbor did for years) knows why all this stuff is critical. If he were a racer, and we told him about power-shiftting (shifting a manual transmission without involving the clutch), he'd tell us he didn't need to understand that. Sad, really. joe *** > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on 22 Mar 2010 14:57 Joseph M. Newcomer wrote: >>> >>> He has been told that MMF can help him. >>> >>> -- >>> HLS >> Since my process (currently) requires unpredictable access >> to far more memory than can fit into the largest cache, I >> see no possible way that adding 1000-fold slower disk access >> could possibly speed things up. This seems absurd to me. > **** > He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates > total and complete cluelessness, plus a startling inabilitgy to understand that we are > making USEFUL suggestions because WE KNOW what is going on and he has no idea. What he doesn't realize is that his 4GB loading is already virtualized. He believes that all of that is in pure RAM. The pages fault prove that point but he doesn't understand what that means. He doesn't realize that his PC is techically a VIRTUAL MACHINE! He doesn't understand the INTEL memory segmentation framework. Maybe he this its DOS? That is why I said if he wants PURE RAM operations, he might be better off with a 16 bit DMPI DOS program or moving over to a MOTOROLA chip that will over offer a linear memory model - if that is still true today. > Like you, I'm giving up. There are two parts: First, I'm actually exploring scaling methods with the simulator I wrote for him. I have a version where I am exploring NUMA that will leverage 2003+ Windows technology. I am going to pencil in getting a test computer with a Intel XEON that offer NUMA. Second, get some good will out of this if I can convince this guy that he needs to change his application to better perform. Or at least understand this his old memory usage paradigm for processes does not apply under Windows. The only reason I can suspect for his ignorance is that he is not a programmer or at the very least, very primitive nature of programming knowledge. A real Windows programmer would under this this basic principles or at least explore what experts are saying. He is not even exploring anything! > I'm dropping out of this discussion. I should too. -- HLS
From: Hector Santos on 22 Mar 2010 14:58 Hector Santos wrote: > > Well, the thing is joe, is that this chip cache is something he will > using. I meant "is NOT something..." -- HLS
From: Peter Olcott on 22 Mar 2010 15:31
Perhaps you did not understand what I said. The essential process inherently requires unpredictable access to memory such that cache spatial or temporal locality of reference rarely occurs. "Hector Santos" <sant9442(a)gmail.com> wrote in message news:e2aedb82-c9ad-44b3-8513-defe82cd876c(a)c16g2000yqd.googlegroups.com... On Mar 22, 11:02 am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote: > (2) When a process requires essentially random (mostly > unpredictable) access to far more memory than can possibly > fit into the largest cache, then actual memory access time > becomes a much more significant factor in determining > actual > response time. As a follow up, in the simulator ProcessData() function: void ProcessData() { KIND num; for(DWORD r = 0; r < nRepeat; r++) { Sleep(1); for (DWORD i=0; i < size; i++) { //num = data[i]; // array num = fmdata[i]; // file mapping array view } } } This is a serialize access to the data. Its not random. When you have multi-threads, you approach a empirical boundary condition where multiple accessors are requesting the same memory. So in one hand, the peter viewpoint, you have contention issue hence slow downs. On the other hand, the you have a CACHING effect, where the reading done by one thread benefits all others. Now, we can alter this ProcessData() by adding a random access logic: void ProcessData() { KIND num; for(DWORD r = 0; r < nRepeat; r++) { Sleep(1); for (DWORD i=0; i < size; i++) { DWORD j = (rand() % size); //num = data[j]; // array num = fmdata[j]; // file mapping array view } } } One would suspect higher pressures to move virtual memory into the process working set in random fashion. But in reality, that randomness may not be as over pressuring as you expect. Lets test this randomness. First a test with serialized access with two thread using a 1.5GB file map. V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 - size : 3000000 - memory : 1536000000 (1500000K) - repeat : 2 - Memory Load : 22% - Allocating Data .... 0 * Starting threads - Creating thread 0 - Creating thread 1 * Resuming threads - Resuming thread# 0 in 743 msecs. - Resuming thread# 1 in 868 msecs. * Wait For Thread Completion - Memory Load: 95% * Done --------------------------------------- 0 | Time: 5734 | Elapsed: 0 1 | Time: 4906 | Elapsed: 0 --------------------------------------- Total Time: 10640 Notice the MEMORY LOAD climbed to 95%, thats because the entire spectrum of the data was read in. Now lets try unpredictable random access. I added a /j switch to enable the random indexing. V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j - size : 3000000 - memory : 1536000000 (1500000K) - repeat : 2 - Memory Load : 22% - Allocating Data .... 0 * Starting threads - Creating thread 0 - Creating thread 1 * Resuming threads - Resuming thread# 0 in 116 msecs. - Resuming thread# 1 in 522 msecs. * Wait For Thread Completion - Memory Load: 23% * Done --------------------------------------- 0 | Time: 4250 | Elapsed: 0 1 | Time: 4078 | Elapsed: 0 --------------------------------------- Total Time: 8328 BEHOLD, it is even faster because of the randomness. The memory loading didn't climb because it didn't need to virtually load the entire 1.5GB into the process working set. So once again, your engineering (and lack thereof) philosophy is completely off base. You are under utilizing the power of your machine. -- HLS |