Prev: Runtime Error does not specify program???
Next: windows 7 - imitating touch by sending wm_gesture ?
From: Joseph M. Newcomer on 20 Mar 2010 19:18 See below... On Sat, 20 Mar 2010 13:02:02 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: > >"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message >news:OshXHLFyKHA.5576(a)TK2MSFTNGP05.phx.gbl... >> Hector Santos wrote: >> >>>> <not yelling, emphasizing> >>>> MEMORY BANDWIDTH SPEED IS THE BOTTLENECK >>>> </not yelling, emphasizing> >> >> > >> >>> <BIG> >>> YOU DON'T KNOW WHAT YOU ARE DOING! >>> </BIG> >>> >>> You don't have a freaking CRAY application need! Plus if >>> you said your process time is 100ms is less, then YOU >>> DON'T KNOW what you are talking about if you say you >>> can't handle more than one thread. >>> >>> It means YOU PROGRAMMED YOUR SOFTWARE WRONG! >> >> Look, you can't take a single thread process that demands >> 4GB of meta processing and believe that this is optimized >> for a WINTEL QUAD machine to run as single thread process >> instances, and then use as a BASELINE for any other >> WEB-SERVICE design. Its foolish. > >Do you want me to paypal you fifty dollars? All that I need >is some way to get your paypal email address. You can email >me at PeteOlcott(a)gmail.com Only send me your paypal address >because I never check this mailbox. If you do send me your >paypal address, please tell me so I can check this email box >that I never otherwise check. > >> >> You have to redesign your OCR software to make it >> thread-ready and use sharable data so that its only LOADED >> once and USED many times. >> >> If you have thousands of font glyph files, then you can >> use a memory mapped class array shared data. I will >> guarantee you that will allow you to run multiple threads. > >I am still convinced that multiple threads for my OCR >process is a bad idea. I think that the only reason that you >are not seeing this is that you don't understand my >technology well enough. I also don't think that there exists >any possible redesign that would not reduce performance. The >design is fundamentally based on leveraging large amounts of >RAM to increase speed. Because I am specifically leveraging >RAM to increase speed, the architecture is necessarily >memory bandwidth intensive.*** **** Why? What evidence do you have to suggest this would be a "bad idea". It would allow you to have more than one reconigtion going on concurrently, in the same image, and if you believe the whole image is going to remain resident, then the second thread would cause no page faults and therefore effectively be "free". If you ar running multicore, then you should be able to get throughput equal to the number of cores, which means concurrent requiestss would fall within the magical 500ms limit, which you thought was so critiical last week, so critical it was non-negotiable. I guess it wasn't, since you clearly don't care about performance this week. Notice that multithreading doesn't require additional memory bandwidth, because you most likely are going to be running on multiple cores, with multiple caches, and if you aren't, it isn't going to require any more memory bandwidth on a single core because the cache is probably going to smooth this out. joe **** > >> >> But if you insist it can only be a FIFO single thread >> processor, well, you are really wasting people time here >> because everything else you want to do contradicts your >> limitations. You want to put a web server INTO your OCR, >> when in reality, you need to put your OCR into your WEB >> SERVER. >> >> -- >> HLS > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 20 Mar 2010 19:25 See below... On Sat, 20 Mar 2010 15:12:43 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: >>>(5) OCR processing time is less than 100 milliseconds. >> **** >> As you point out, if there are 10 requests pending, this >> mens 1sec, which violates your >> 500ms goal. But if you had a concurrent server, it would >> be (nrequests * 100)/ncores >> milliseconds, so for a quad-core server with 10 pending >> requests and a 4-thread pool you >> would have only 250ms, within your goals. If response >> time is so critical, why are you >> not running multithreaded already? >> **** > >If I get more than an average of one request per second, I >will get another server. My process does not improve with >my multiple cores, it does get faster with faster memory. >800% faster memory at the same core speed provided an 800% >increase in speed. Two processes on a quad core resulted in >half the speed for each. **** Well, if you have not designed your code to run multithreaded, how do you KNOW it won't run faster if you add more cores? If it is single threaded, it will run at EXACTLY the same speed on an 8-core system as it does on a uniprocessor, because you have no concurrency, but if you want to process 8 requests, you currently require ~800ms, which violates you apparently nonnegotiable 500ms limit, but if you run multithreaded, an 8-core system could handle all 8 of the them concurrently, meaning your total processing time on 8 concurrent requests is ~100ms. Or have I missed something here, and was the 500ms limit avandoned? Seriously, how hard can it be to convert code that requires no locking to multithreaded? joe **** > >>> >>> >> Joseph M. Newcomer [MVP] >> email: newcomer(a)flounder.com >> Web: http://www.flounder.com >> MVP Tips: http://www.flounder.com/mvp_tips.htm > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 20 Mar 2010 23:27 "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message news:felaq598kd3crokhet06bdmvl27bot4al6(a)4ax.com... > See below... > On Sat, 20 Mar 2010 13:02:02 -0500, "Peter Olcott" > <NoSpam(a)OCR4Screen.com> wrote: > >> >>"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in >>message >>news:OshXHLFyKHA.5576(a)TK2MSFTNGP05.phx.gbl... >>> Hector Santos wrote: >>> >>>>> <not yelling, emphasizing> >>>>> MEMORY BANDWIDTH SPEED IS THE BOTTLENECK >>>>> </not yelling, emphasizing> >>> >>> > >>> >>>> <BIG> >>>> YOU DON'T KNOW WHAT YOU ARE DOING! >>>> </BIG> >>>> >>>> You don't have a freaking CRAY application need! Plus >>>> if >>>> you said your process time is 100ms is less, then YOU >>>> DON'T KNOW what you are talking about if you say you >>>> can't handle more than one thread. >>>> >>>> It means YOU PROGRAMMED YOUR SOFTWARE WRONG! >>> >>> Look, you can't take a single thread process that >>> demands >>> 4GB of meta processing and believe that this is >>> optimized >>> for a WINTEL QUAD machine to run as single thread >>> process >>> instances, and then use as a BASELINE for any other >>> WEB-SERVICE design. Its foolish. >> >>Do you want me to paypal you fifty dollars? All that I >>need >>is some way to get your paypal email address. You can >>me at PeteOlcott(a)gmail.com Only send me your paypal >>address >>because I never check this mailbox. If you do send me >>your >>paypal address, please tell me so I can check this email >>box >>that I never otherwise check. >> >>> >>> You have to redesign your OCR software to make it >>> thread-ready and use sharable data so that its only >>> LOADED >>> once and USED many times. >>> >>> If you have thousands of font glyph files, then you can >>> use a memory mapped class array shared data. I will >>> guarantee you that will allow you to run multiple >>> threads. >> >>I am still convinced that multiple threads for my OCR >>process is a bad idea. I think that the only reason that >>you >>are not seeing this is that you don't understand my >>technology well enough. I also don't think that there >>exists >>any possible redesign that would not reduce performance. >>The >>design is fundamentally based on leveraging large amounts >>of >>RAM to increase speed. Because I am specifically >>leveraging >>RAM to increase speed, the architecture is necessarily >>memory bandwidth intensive.*** > **** > Why? What evidence do you have to suggest this would be a > "bad idea". It would allow you > to have more than one reconigtion going on concurrently, > in the same image, and if you (I have already said these thing several times before) Empirical: (1) I tried it and it doesn't work, it cuts the performance of each process by at least half. (2) The fact that I achieved an 800% performance improvement between one machine and another and the primary difference was 800% faster RAM shows that my process must be taking essentially all of the memory bandwidth. Analytical If my process is already taking ALL of the memory bandwidth, then adding another thread of execution can not possibly help, because the process is memory bandwidth bound, not CPU bound. > believe the whole image is going to remain resident, then > the second thread would cause no > page faults and therefore effectively be "free". If you > ar running multicore, then you But each process would still have to take turns accessing the memory bus. The memory bus has a finite maximum access speed. If one process is already using ALL of this up, then another process or thread can not possibly help. > should be able to get throughput equal to the number of > cores, which means concurrent Only for CPU bound processes, not for memory access bound processes. > requiestss would fall within the magical 500ms limit, > which you thought was so critiical > last week, so critical it was non-negotiable. I guess it > wasn't, since you clearly don't > care about performance this week. Notice that > multithreading doesn't require additional > memory bandwidth, because you most likely are going to be > running on multiple cores, with > multiple caches, and if you aren't, it isn't going to > require any more memory bandwidth on > a single core because the cache is probably going to > smooth this out. > joe > **** Nope not in my case. In my case I must have access to a much larger DFA than will possibly fit into cache. With the redesign there are often times that a DFA will fit into cache. With this new design I may have 1,000 (or more) DFAs all loaded at once, thus still requiring fast RAM access. Some of these DFAs will not fit into cache. I have not tested the new design yet. In the case of the new design it might be possible to gain from multiple cores. Much more interesting to me than this, is testing your theory about cache hit ratio. If you are right, then my basic design will gain a huge amount of performance. The cache hit ratio could improve from 5% to 95%. >> >>> >>> But if you insist it can only be a FIFO single thread >>> processor, well, you are really wasting people time here >>> because everything else you want to do contradicts your >>> limitations. You want to put a web server INTO your >>> OCR, >>> when in reality, you need to put your OCR into your WEB >>> SERVER. >>> >>> -- >>> HLS >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 20 Mar 2010 23:30 "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message news:52maq5h8eg3490hrv06l066r76fp02fo0u(a)4ax.com... > See below... > On Sat, 20 Mar 2010 15:12:43 -0500, "Peter Olcott" > <NoSpam(a)OCR4Screen.com> wrote: >>>>(5) OCR processing time is less than 100 milliseconds. >>> **** >>> As you point out, if there are 10 requests pending, this >>> mens 1sec, which violates your >>> 500ms goal. But if you had a concurrent server, it >>> would >>> be (nrequests * 100)/ncores >>> milliseconds, so for a quad-core server with 10 pending >>> requests and a 4-thread pool you >>> would have only 250ms, within your goals. If response >>> time is so critical, why are you >>> not running multithreaded already? >>> **** >> >>If I get more than an average of one request per second, I >>will get another server. My process does not improve with >>my multiple cores, it does get faster with faster memory. >>800% faster memory at the same core speed provided an 800% >>increase in speed. Two processes on a quad core resulted >>in >>half the speed for each. > **** > Well, if you have not designed your code to run > multithreaded, how do you KNOW it won't > run faster if you add more cores? If it is single > threaded, it will run at EXACTLY the > same speed on an 8-core system as it does on a > uniprocessor, because you have no > concurrency, but if you want to process 8 requests, you > currently require ~800ms, which > violates you apparently nonnegotiable 500ms limit, but if > you run multithreaded, an 8-core > system could handle all 8 of the them concurrently, > meaning your total processing time on > 8 concurrent requests is ~100ms. Or have I missed > something here, and was the 500ms limit > avandoned? > > Seriously, how hard can it be to convert code that > requires no locking to multithreaded? > joe How can I empirically test exactly how much of the total memory bandwidth that my process is taking up? Would you agree that if the empirical test shows that my single process is taking up 100% of the memory bandwidth, that multiple cores or multiple threads could not help increase speed? > **** >> >>>> >>>> >>> Joseph M. Newcomer [MVP] >>> email: newcomer(a)flounder.com >>> Web: http://www.flounder.com >>> MVP Tips: http://www.flounder.com/mvp_tips.htm >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on 21 Mar 2010 00:38
Peter Olcott wrote: > Would you agree that if the empirical test shows that my > single process is taking up 100% of the memory bandwidth, > that multiple cores or multiple threads could not help > increase speed? You are thinking about this all wrong. You have a quantum based context switching you can't stop, even for a single thread process. In other words, you will never have 100% full exclusive control of MEMORY ACCESS - never. If you did, nothing else will run. What I am saying is this, suppose you have 10 lines of DFA C code, the compiler creates OP CODES for these 10 lines. Each OP CODE has a fixed frequency cycle. When the accumulated frequency reaches a QUANTUM (~15ms), you will get a context switch - in other words, your code is preempted (stop), swapped out, and Windows will give all other threads a change to run. That gives other threads in your process, if it was multi-thread to do the same type of memory access work. Since it is READ ONLY, there is no contention. If your preempted thread BLOCKED it, then your have contention or even a dead lock - but you are not doing that. You are reading only READ ONLY memory - which will have a maximum access. Now comes a MULTI-CORE, and you have two or more threads, the SPEED is that there is NO CONTEXT SWITCHING - you still may have the same memory access, but that would be no slower if it was single cpu. Your speed comes in less context switching. Understand? In short: single cpu: speed lost due to context switching multi cpu/core: less context switching, more resident time. You can' not think of term of a single thread process because there is no advantage for it on a multi-core/cpu machine. The INTEL Multi-Core chips has advanced technology to help multi-threaded applications. Single thread processes can not benefit on multi-core machine. They must be designed for threads to see any benefits. If you want to read up on it, check out the Intel technical documents, like this one: http://download.intel.com/technology/architecture/sma.pdf Specifically read SMA "Smart Memory Access" The bottom line is really simple: You have a single process with a huge memory load. Each instance redundantly create additional huge memory loads and that alone will cause a SYSTEM WIDE performance serious degradation with huge page faulting and context switching delays. You will never get any improvements until you change your memory usage for intelligent sharable and use threads. When done correctly, you will gain benefits provided by the OS and machine. You really need to look at this as a whole: 1 process - with X number of threads vs X single thread Processes. You need to trust us this is NOT the same when the DATA is HUGE!. In the threaded model, it is shared. In the non-threaded model, is redundant for each instance - that will murder you! If you had NO HUGE data requirement, then they become more equal because now its just CODE. Now, it is conceivable that for specific your application, you might realize that X may be 5-10 threads before you see a performance issue that isn't towards your liken. Show me how you are using std::vector with your files, and I will create a simulator for you to PROVE to you how your thinking is all wrong. This simulator will allow you to fine tune it to determine what is your boundary conditions for performance. While you have 20,000 hrs into this WITHOUT even exploring high end thread designs, I have 6 years in Intel RMX http://en.wikipedia.org/wiki/RMX_(operating_system) which was considered one of the early Intel "multi-thread" frameworks we have today and gave me an early nature understanding when NT 3.1 (17 years?) where I have done exclusively high-end multi-threaded commercial server products since then. Count the hours! I can assure you, your single process thinking is wrong. -- HLS |