From: Woody on 22 Mar 2010 03:49 On Mar 21, 11:19 am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote: > I have an application that uses enormous amounts of RAM in a > very memory bandwidth intensive way. Unfortunately, you cannot think of a single "memory bandwidth" as being the limiting factor. The cache behavior is the biggest determinant of speed, because reading/writing is much faster to cache than to main memory (that's why cache is there). To truly optimize an application, and to answer your question about more threads, you must consider the low-level details of memory usage, such as, what is the size of the cache lines? How is memory interleaved? What is the size of the translation look-aside buffer? How is cache shared among the cores? Is there one memory controller per processor (if you have multiple processors), or per core? There are tools such as AMD CodeAnalyst (free) or Intel VTune ($$$) that measure these things. Once you know where the bottlenecks really are, you can go to work rearranging your code to keep all the computer's resources busy. You will need to run all the tests on your actual app, or something close to it, for meaningful results. BTW, the same details that determine memory speed also make the comparison of CPU speed meaningless.
From: Peter Olcott on 22 Mar 2010 07:28 "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message news:%23IrS0xYyKHA.2012(a)TK2MSFTNGP04.phx.gbl... > Here is the result using a 1.5GB readonly memory mapped > file. I started with 1 single process thread, then switch > to 2 threads, then 4, 6, 8, 10 and 12 threads. Notice how > the processing time for the earlier threads started high > but decreased with the later thread. This was the caching > effect of the readonly memory file. Also note the Global > Memory Status *MEMORY LOAD* percentage. For my machine, it > is at 19% at steady state. But as expected it shuts up > when dealing with this large memory map file. I probably > can fine tune the map views better, but they are set as > read only. Well, I'll leave OP to figure out memory maps > coding for his patented DFA meta file process. > > V:\wc5beta>testpeter3t /s:3000000 /r:1 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 0 > --------------------------------------- > Time: 2984 | Elapsed: 0 > --------------------------------------- > Total Client Time: 2984 > > V:\wc5beta>testpeter3t /s:3000000 /t:2 /r:1 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 0 > * Starting threads > - Creating thread 0 > - Creating thread 1 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > * Wait For Thread Completion > - Memory Load: 96% > * Done > --------------------------------------- > 0 | Time: 5407 | Elapsed: 0 > 1 | Time: 4938 | Elapsed: 0 > --------------------------------------- > Total Time: 10345 > > V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:4 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 0 > * Starting threads > - Creating thread 0 > - Creating thread 1 > - Creating thread 2 > - Creating thread 3 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > - Resuming thread# 2 in 334 msecs. > - Resuming thread# 3 in 500 msecs. > * Wait For Thread Completion > - Memory Load: 97% > * Done > --------------------------------------- > 0 | Time: 6313 | Elapsed: 0 > 1 | Time: 5844 | Elapsed: 0 > 2 | Time: 5500 | Elapsed: 0 > 3 | Time: 5000 | Elapsed: 0 > --------------------------------------- > Total Time: 22657 > > V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:6 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 0 > * Starting threads > - Creating thread 0 > - Creating thread 1 > - Creating thread 2 > - Creating thread 3 > - Creating thread 4 > - Creating thread 5 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > - Resuming thread# 2 in 334 msecs. > - Resuming thread# 3 in 500 msecs. > - Resuming thread# 4 in 169 msecs. > - Resuming thread# 5 in 724 msecs. > * Wait For Thread Completion > - Memory Load: 97% > * Done > --------------------------------------- > 0 | Time: 6359 | Elapsed: 0 > 1 | Time: 5891 | Elapsed: 0 > 2 | Time: 5547 | Elapsed: 0 > 3 | Time: 5047 | Elapsed: 0 > 4 | Time: 4875 | Elapsed: 0 > 5 | Time: 4141 | Elapsed: 0 > --------------------------------------- > Total Time: 31860 > > V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:8 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 16 > * Starting threads > - Creating thread 0 > - Creating thread 1 > - Creating thread 2 > - Creating thread 3 > - Creating thread 4 > - Creating thread 5 > - Creating thread 6 > - Creating thread 7 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > - Resuming thread# 2 in 334 msecs. > - Resuming thread# 3 in 500 msecs. > - Resuming thread# 4 in 169 msecs. > - Resuming thread# 5 in 724 msecs. > - Resuming thread# 6 in 478 msecs. > - Resuming thread# 7 in 358 msecs. > * Wait For Thread Completion > - Memory Load: 96% > * Done > --------------------------------------- > 0 | Time: 6203 | Elapsed: 0 > 1 | Time: 5734 | Elapsed: 0 > 2 | Time: 5391 | Elapsed: 0 > 3 | Time: 4891 | Elapsed: 0 > 4 | Time: 4719 | Elapsed: 0 > 5 | Time: 3984 | Elapsed: 0 > 6 | Time: 3500 | Elapsed: 0 > 7 | Time: 3125 | Elapsed: 0 > --------------------------------------- > Total Time: 37547 > > V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:10 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 0 > * Starting threads > - Creating thread 0 > - Creating thread 1 > - Creating thread 2 > - Creating thread 3 > - Creating thread 4 > - Creating thread 5 > - Creating thread 6 > - Creating thread 7 > - Creating thread 8 > - Creating thread 9 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > - Resuming thread# 2 in 334 msecs. > - Resuming thread# 3 in 500 msecs. > - Resuming thread# 4 in 169 msecs. > - Resuming thread# 5 in 724 msecs. > - Resuming thread# 6 in 478 msecs. > - Resuming thread# 7 in 358 msecs. > - Resuming thread# 8 in 962 msecs. > - Resuming thread# 9 in 464 msecs. > * Wait For Thread Completion > - Memory Load: 97% > * Done > --------------------------------------- > 0 | Time: 7234 | Elapsed: 0 > 1 | Time: 6766 | Elapsed: 0 > 2 | Time: 6422 | Elapsed: 0 > 3 | Time: 5922 | Elapsed: 0 > 4 | Time: 5750 | Elapsed: 0 > 5 | Time: 5016 | Elapsed: 0 > 6 | Time: 4531 | Elapsed: 0 > 7 | Time: 4125 | Elapsed: 0 > 8 | Time: 3203 | Elapsed: 0 > 9 | Time: 2703 | Elapsed: 0 > --------------------------------------- > Total Time: 51672 > > V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:12 > - size : 3000000 > - memory : 1536000000 (1500000K) > - repeat : 1 > - Memory Load : 25% > - Allocating Data .... 16 > * Starting threads > - Creating thread 0 > - Creating thread 1 > - Creating thread 2 > - Creating thread 3 > - Creating thread 4 > - Creating thread 5 > - Creating thread 6 > - Creating thread 7 > - Creating thread 8 > - Creating thread 9 > - Creating thread 10 > - Creating thread 11 > * Resuming threads > - Resuming thread# 0 in 41 msecs. > - Resuming thread# 1 in 467 msecs. > - Resuming thread# 2 in 334 msecs. > - Resuming thread# 3 in 500 msecs. > - Resuming thread# 4 in 169 msecs. > - Resuming thread# 5 in 724 msecs. > - Resuming thread# 6 in 478 msecs. > - Resuming thread# 7 in 358 msecs. > - Resuming thread# 8 in 962 msecs. > - Resuming thread# 9 in 464 msecs. > - Resuming thread# 10 in 705 msecs. > - Resuming thread# 11 in 145 msecs. > * Wait For Thread Completion > - Memory Load: 97% > * Done > --------------------------------------- > 0 | Time: 7984 | Elapsed: 0 > 1 | Time: 7515 | Elapsed: 0 > 2 | Time: 7188 | Elapsed: 0 > 3 | Time: 6672 | Elapsed: 0 > 4 | Time: 6500 | Elapsed: 0 > 5 | Time: 5781 | Elapsed: 0 > 6 | Time: 5250 | Elapsed: 0 > 7 | Time: 4953 | Elapsed: 0 > 8 | Time: 3953 | Elapsed: 0 > 9 | Time: 3484 | Elapsed: 0 > 10 | Time: 2750 | Elapsed: 0 > 11 | Time: 2547 | Elapsed: 0 > --------------------------------------- > Total Time: 64577 > > > -- > HLS OK and where is the summary conclusion? Also by using a memory mapped file your process would have entirely different behavior than mine. I known that it is possible that you could have been right all along about this, and I could be wrong. I know this because of a term that I coined. [Ignorance Squared]. [Ignorance Squared] is the process by which a lack of understanding is perceived by the one whom lacks this understanding as disagreement. Whereas the one whom has understanding knows that the ignorant person is lacking understanding the ignorant person lacks this insight, and is thus ignorant even of their own ignorance, hence the term [Ignorance Squared] . Now that I have a way to empirically validate your theories against mine (that I dreamed up last night while sleeping) I will do this.
From: Peter Olcott on 22 Mar 2010 07:33 It is very hard to reply to messages with quoting turned off, please turn quoting on. Also please tell me how quoting gets turned off. When a process requires continual essentially random access to data that is very much larger than the largest cache, then I think that memory bandwidth could be a limiting factor to performance. "Woody" <ols6000(a)sbcglobal.net> wrote in message news:7ff25c57-b2a7-4b31-b3df-bebcf34ead80(a)d37g2000yqn.googlegroups.com... On Mar 21, 11:19 am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote: > I have an application that uses enormous amounts of RAM in > a > very memory bandwidth intensive way. Unfortunately, you cannot think of a single "memory bandwidth" as being the limiting factor. The cache behavior is the biggest determinant of speed, because reading/writing is much faster to cache than to main memory (that's why cache is there). To truly optimize an application, and to answer your question about more threads, you must consider the low-level details of memory usage, such as, what is the size of the cache lines? How is memory interleaved? What is the size of the translation look-aside buffer? How is cache shared among the cores? Is there one memory controller per processor (if you have multiple processors), or per core? There are tools such as AMD CodeAnalyst (free) or Intel VTune ($$$) that measure these things. Once you know where the bottlenecks really are, you can go to work rearranging your code to keep all the computer's resources busy. You will need to run all the tests on your actual app, or something close to it, for meaningful results. BTW, the same details that determine memory speed also make the comparison of CPU speed meaningless.
From: Joseph M. Newcomer on 22 Mar 2010 10:31 See below... On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: > >"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in >message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com... >> NOte in the i7 architecture the L3 cache is shared across >> all CPUs, so you are less likely >> to be hit by raw memory bandwidth (which compared to a CPU >> is dead-slow), and the answer s >> so whether multiple threads will work effectively can only >> be determined by measurement of >> a multithreaded app. >> >> Because your logic seems to indicate that raw memory speed >> is the limiting factor, and you >> have not accounted for the effects of a shared L3 cache, >> any opnion you offer on what is >> going to happen is meaningless. In fact, any opinion >> about performanance is by definition >> meaningless; only actual measurements represent facts ("If >> you can't express it in >> numbers, it ain't science, it's opinion" -- Robert A. >> Heinlein) > >(1) Machine A performs process B in X minutes. >(2) Machine C performs process B in X/8 Minutes (800% >faster) >(3) The only difference between machine A and machine C is >that machine C has much faster access to RAM (by whatever >means). >(4) Therefore Process B is memory bandwidth bound. *** Fred can dig a ditch 10 feet long in 1 hour. Charlie can dig a ditch 10 feet long in 20 minutes. Therefore, Charlie is faster than Fred by a factor of 3. How long does it take Fred and Charlie working together to dig a ditch 10 feet long? (Hint: any mathematical answer you come up with is wrong, because Fred and Charlie (a) hate each other, and so Charlie tosses his dirt into the place Fred has to dig or (b) are good buddies and stop for a beer halfway through the digging or (c) Chalie tells Fred he can do it faster by himself, and Fred just sits there while Charlie does all the work and finishes in 20 minutes, after which they go out for a beer. Fred buys. You have made an obvious failure here in thinking that if one thread takes 1/k the time and the only difference is memory bandwidth, that two threads are necessarily LINEAR. Duh! IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO DATA! You are jumping to an unwarranted conclusion based on what I can at best tell is a coincidence. And even if it was true, caches give nonlinear effects, so you are not even making sense when you make these assertions! You have proven a case for value N, but you have immediately assumed that if you prove the case for N, you have proven it for case N+1, which is NOT how inductive proofs work! So you were so hung up on geometric proofs, can you explain how, when doing an inductive proof, that proving the case for 1 element tells you what the result is for N+1 for arbitrary value N? Hell, it doesn't even tell you the results for N=1, but you have immediately assumed that it is a valid proof for all values of N! YOU HAVE NO DATA! You are making a flawed assumption of linearity that has no basis! Going to your fixation on proof, in a nonlinear system without a closed-form analytic solution, demonstrate to me that your only possible solution is based on a linear assumption. You are ignoring all forms of reality here. You are asseting without basis that the system is linear (it is known that systems with caches are nonlinear in memory performance). So you are contradicting known reality without any evidence to support your "axiom". It ain't an axiom, it's a wild-assed-guess. Until you can demonstrate with actual measured performance that your system is COMPLETELY linear behavior in an L3 cache system, there is no reason to listen to any of this nonsense you keep esposusing as if it were "fact". You have ONE fact, and that is not enough to raise your hypothesis to the level of "axiom". All you have proven is that a single thread is limited by memory bandwidth. You have no reason to infer that two threads will not BOTH run faster because of the L3 cache effects. And you have ignored L1/L2 cache effects. You have a trivial example from which NOTHING can be inferred about multithreaded performance. You have consistently confused multiprocess programming with multithreading and arrived at erroneous conclusions based on flawed experiments. Note also if you use a memory-mapped file and two processes share the same mapping object there is only one copy of the data in memory! THis has not previously come up in discussions, but could be critical to your performance of multiple processes. joe **** > >> More below... >> On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott" >> <NoSpam(a)OCR4Screen.com> wrote: >> >>>I have an application that uses enormous amounts of RAM in >>>a >>>very memory bandwidth intensive way. I recently upgraded >>>my >>>hardware to a machine with 600% faster RAM and 32-fold >>>more >>>L3 cache. This L3 cache is also twice as fast as the prior >>>machines cache. When I benchmarked my application across >>>the >>>two machines, I gained an 800% improvement in wall clock >>>time. The new machines CPU is only 11% faster than the >>>prior >>>machine. Both processes were tested on a single CPU. >> *** >> The question is whether you are measuring multiple threads >> in a single executable image >> across multiple cores, or multiple executable images on a >> single core. Not sure how you >> know that both processes were tested on a single CPU, >> since you don't mention how you >> accomplished this (there are several techniques, but it is >> important to know which one you >> used, since each has its own implications for predicting >> overall behavior of a system). >> **** >>> >>>I am thinking that all of the above would tend to show >>>that >>>my process is very memory bandwidth intensive, and thus >>>could not benefit from multiple threads on the same >>>machine >>>because the bottleneck is memory bandwidth rather than CPU >>>cycles. Is this analysis correct? >> **** >> Nonsense! You have no idea what is going on here! The >> shared L3 cache could completely >> wipe out the memory performance issue, reducing your >> problem to a cache-performance issue. >> Since you have not conducted the experiment in multiple >> threading, you have no data to >> indicate one way or the other what is going on, and it is >> the particular memory access >> patterns of YOUR app that matter, and therefore, nobody >> can offer a meaningful estimate >> based on your L1/L2/L3 cache accessses, whatever they may >> be. >> joe >> **** >>> >> Joseph M. Newcomer [MVP] >> email: newcomer(a)flounder.com >> Web: http://www.flounder.com >> MVP Tips: http://www.flounder.com/mvp_tips.htm > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 22 Mar 2010 11:02
"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message news:ioueq5hdsf5ut5pha6ttt88e1ghl4q9l1m(a)4ax.com... > See below... > On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott" > <NoSpam(a)OCR4Screen.com> wrote: > >> >>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in >>message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com... >>> NOte in the i7 architecture the L3 cache is shared >>> across >>> all CPUs, so you are less likely >>> to be hit by raw memory bandwidth (which compared to a >>> CPU >>> is dead-slow), and the answer s >>> so whether multiple threads will work effectively can >>> only >>> be determined by measurement of >>> a multithreaded app. >>> >>> Because your logic seems to indicate that raw memory >>> speed >>> is the limiting factor, and you >>> have not accounted for the effects of a shared L3 cache, >>> any opnion you offer on what is >>> going to happen is meaningless. In fact, any opinion >>> about performanance is by definition >>> meaningless; only actual measurements represent facts >>> ("If >>> you can't express it in >>> numbers, it ain't science, it's opinion" -- Robert A. >>> Heinlein) >> >>(1) Machine A performs process B in X minutes. >>(2) Machine C performs process B in X/8 Minutes (800% >>faster) >>(3) The only difference between machine A and machine C is >>that machine C has much faster access to RAM (by whatever >>means). >>(4) Therefore Process B is memory bandwidth bound. > *** > Fred can dig a ditch 10 feet long in 1 hour. Charlie can > dig a ditch 10 feet long in 20 > minutes. Therefore, Charlie is faster than Fred by a > factor of 3. > > How long does it take Fred and Charlie working together to > dig a ditch 10 feet long? > (Hint: any mathematical answer you come up with is wrong, > because Fred and Charlie (a) > hate each other, and so Charlie tosses his dirt into the > place Fred has to dig or (b) are > good buddies and stop for a beer halfway through the > digging or (c) Chalie tells Fred he > can do it faster by himself, and Fred just sits there > while Charlie does all the work and > finishes in 20 minutes, after which they go out for a > beer. Fred buys. > > You have made an obvious failure here in thinking that if > one thread takes 1/k the time > and the only difference is memory bandwidth, that two > threads are necessarily LINEAR. Duh! > IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO > DATA! You are jumping to an (1) People in a more specialized group are coming to the same conclusions that I have derived. (2) When a process requires essentially random (mostly unpredictable) access to far more memory than can possibly fit into the largest cache, then actual memory access time becomes a much more significant factor in determining actual response time. > unwarranted conclusion based on what I can at best tell is > a coincidence. And even if it > was true, caches give nonlinear effects, so you are not > even making sense when you make > these assertions! You have proven a case for value N, but > you have immediately assumed > that if you prove the case for N, you have proven it for > case N+1, which is NOT how > inductive proofs work! So you were so hung up on > geometric proofs, can you explain how, > when doing an inductive proof, that proving the case for 1 > element tells you what the > result is for N+1 for arbitrary value N? Hell, it > doesn't even tell you the results for > N=1, but you have immediately assumed that it is a valid > proof for all values of N! > > YOU HAVE NO DATA! You are making a flawed assumption of > linearity that has no basis! > Going to your fixation on proof, in a nonlinear system > without a closed-form analytic > solution, demonstrate to me that your only possible > solution is based on a linear > assumption. You are ignoring all forms of reality here. > You are asseting without basis > that the system is linear (it is known that systems with > caches are nonlinear in memory > performance). So you are contradicting known reality > without any evidence to support your > "axiom". It ain't an axiom, it's a wild-assed-guess. > > Until you can demonstrate with actual measured performance > that your system is COMPLETELY > linear behavior in an L3 cache system, there is no reason > to listen to any of this > nonsense you keep esposusing as if it were "fact". You > have ONE fact, and that is not > enough to raise your hypothesis to the level of "axiom". > > All you have proven is that a single thread is limited by > memory bandwidth. You have no > reason to infer that two threads will not BOTH run faster > because of the L3 cache effects. > And you have ignored L1/L2 cache effects. You have a > trivial example from which NOTHING > can be inferred about multithreaded performance. You have > consistently confused > multiprocess programming with multithreading and arrived > at erroneous conclusions based on > flawed experiments. > > Note also if you use a memory-mapped file and two > processes share the same mapping object > there is only one copy of the data in memory! THis has > not previously come up in > discussions, but could be critical to your performance of > multiple processes. > joe > **** >> >>> More below... >>> On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott" >>> <NoSpam(a)OCR4Screen.com> wrote: >>> >>>>I have an application that uses enormous amounts of RAM >>>>in >>>>a >>>>very memory bandwidth intensive way. I recently >>>>upgraded >>>>my >>>>hardware to a machine with 600% faster RAM and 32-fold >>>>more >>>>L3 cache. This L3 cache is also twice as fast as the >>>>prior >>>>machines cache. When I benchmarked my application across >>>>the >>>>two machines, I gained an 800% improvement in wall clock >>>>time. The new machines CPU is only 11% faster than the >>>>prior >>>>machine. Both processes were tested on a single CPU. >>> *** >>> The question is whether you are measuring multiple >>> threads >>> in a single executable image >>> across multiple cores, or multiple executable images on >>> a >>> single core. Not sure how you >>> know that both processes were tested on a single CPU, >>> since you don't mention how you >>> accomplished this (there are several techniques, but it >>> is >>> important to know which one you >>> used, since each has its own implications for predicting >>> overall behavior of a system). >>> **** >>>> >>>>I am thinking that all of the above would tend to show >>>>that >>>>my process is very memory bandwidth intensive, and thus >>>>could not benefit from multiple threads on the same >>>>machine >>>>because the bottleneck is memory bandwidth rather than >>>>CPU >>>>cycles. Is this analysis correct? >>> **** >>> Nonsense! You have no idea what is going on here! The >>> shared L3 cache could completely >>> wipe out the memory performance issue, reducing your >>> problem to a cache-performance issue. >>> Since you have not conducted the experiment in multiple >>> threading, you have no data to >>> indicate one way or the other what is going on, and it >>> is >>> the particular memory access >>> patterns of YOUR app that matter, and therefore, nobody >>> can offer a meaningful estimate >>> based on your L1/L2/L3 cache accessses, whatever they >>> may >>> be. >>> joe >>> **** >>>> >>> Joseph M. Newcomer [MVP] >>> email: newcomer(a)flounder.com >>> Web: http://www.flounder.com >>> MVP Tips: http://www.flounder.com/mvp_tips.htm >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm |