Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 23 Mar 2010 14:54

Peter O, we went thru this - it is only by "chance" that you are not
seeing faults.

Of course, good engineering and good hardware molds the chance. You
got plenty of memory, no other hogging process is in play and in this
test, you kept it at 1.5GB working set space which is good. With the
QUAD, you can suspect to do four threads with 1.5GB each and now you
have a 6GB demand. So you should see natural performance issues but
hey, the machine can hide it to because its fast.

But in general, this is where the MEMORY MAP comes in. Even if you
did a single process, you can create a shared MMF dll that each EXE
will share. That will keep it at 1.5GB instead of 6GB and now you may
be able to work in even more threads. That is what the simulation
helps you determine and remember the simulation helps gives you
boundary conditions where you know it won't really work like that in
reality.

The shared mmf dll or region with a per process PER web site is how
the browsrs like CHROME and IE 9.0 is going for better web browser
performance and reliability per web site TAB. So this direction is
not odd to do of your OCR is very work expensive.

Peter Olcott wrote:

> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
> news:Ouu4JOryKHA.1236(a)TK2MSFTNGP06.phx.gbl...
>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>> news:kMednXfoL9CQaTXWnZ2dnUVZ_tmdnZ2d(a)giganews.com...
>>> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in
>>> message news:efX%238fqyKHA.5360(a)TK2MSFTNGP06.phx.gbl...
>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>>> news:AeidnYxrl7T0vzXWnZ2dnUVZ_judnZ2d(a)giganews.com...
>>>>> I don't want to hear about memory mapped files because
>>>>> I don't want to hear about optimizing virtual memory
>>>>> usage because I don't want to hear about virtual memory
>>>>> until it is proven beyond all possible doubt that my
>>>>> process does not (and can not be made to be) resident
>>>>> in actual RAM all the time.
>>>> From my understanding of your "test" (simply viewing the
>>>> number of page faults reported by task manager) you can
>>>> only conclude that there have not been any significant
>>>> page faults since your application loaded the data, not
>>>> that your application and data have remined in main
>>>> memory. If you actually attempt to access all of your
>>>> code and data and there are no page faults, I would be
>>>> very surprised. In fact, knowing what I do about the
>>>> cache management in Windows 7, I'm very surprised that
>>>> you are not seeing any page faults at all unless you
>>>> have disabled the caching service.
>>>>
>>>>> Since a test showed that my process did remain in
>>>>> actual RAM for at least twelve hours,
>>>> No. That is not what your simple test showed unless your
>>>> actual test differed significantly from what you
>>>> expressed here.
>>>>
>>>> -Pete
>>>>
>>> (1) I loaded my process
>>> (2) I loaded my process data
>>> (3) I waited twelve hours
>>> (4) I executed my process using its loaded data, and
>>> there were no page faults reported by the process monitor
>>> (5) Therefore my process data remained entirely resident
>>> in actual RAM for at least twelve hours.
>> What program is "process monitor"? Are you referring to
>> the Sysinternals tool or are you referring to Task Manager
>> or Resource Monitor?
>>
>> -Pete
>
> Task Manager
> Process Tab
> View
> Select Columns
> Page Faults

--
HLS

From: Peter Olcott on 23 Mar 2010 14:54

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:epsZzaryKHA.5360(a)TK2MSFTNGP06.phx.gbl...
> Peter Olcott wrote:
>
>> I still think that the FIFO queue is a good idea. Now I
>> will have multiple requests and on multi-core machines
>> multiple servers.
>
>
> IMO, it just that its an odd approach to load balancing.
> You are integrating software components, like a web server
> with an multi-thread ready listening server and you are
> hampering it with a single thread only FIFO queuing. It
> introduces other design considerations. Namely, you will
> need to consider a store and forward concept for your
> request and delayed responses. But if your request
> processing is very fast, maybe you don't need to worry
> about it.
>

I will probably be implementing the FIFO queue using MySQL.
I want customers to have a strict first in first out,
priority the only thing that will change this is that
multiple servers are now feasible. If it costs me an extra
1/2 % of amortized response rate, then this is OK.

Because multiple servers are now available, I will drop the
idea of more than one queue priority. I know that database
access is slow, but this will be amortized over much longer
processing time. This solution should be simple and
portable. Also the database probably has a lot of caching
going on so it probably won't even cost me the typical disk
access time of 5 ms.

> In practice the "FIFO" would be at the socket level or
> listening level with concepts dealing with load balancing
> by restricting and balancing your connection with worker
> pools or simply letting it to wait knowing that processing
> won't be too long. Some servers have guidelines for
> waiting limits. For the WEB, I am not recall coming
> across any specific guideline other than a practical one
> per implementation. The point is you don't want the
> customers waiting too long - but what is "too long."

This sounds like a more efficient approach, but, it may lack
portability, and it may take longer to get operational. I
will already have some sort of SQL learning curve to
implement my authentication database. This sounds like an
extra leaning curve over and above everything else. There
are only so many 1000 page books that I can read in a finite
amount of time.

>
>> What is your best suggestion for how I can implement the
>> FIFO queue?
>> (1) I want it to be very fast
>> (2) I want it to be portable across Unix / Linux /
>> Windows, and maybe even Mac OS X
>> (3) I want it to be as robust and fault tolerant as
>> possible.
>
>
> Any good collection class will do as long as you wrap it
> with synchronization. Example:
>
>
> typedef struct _tagTSlaveData {
> ... data per request.........
> } TSlaveData;
>
> class CBucket : public std::list<TSlaveData>
> {
> public:
> CBucket() { InitializeCriticalSection(&cs); }
> ~CBucket() { DeleteCriticalSection(&cs); }
>
> void Add( const TSlaveData &o )
> {
> EnterCriticalSection(&cs);
> insert(end(), o );
> LeaveCriticalSection(&cs);
> }
>
> BOOL Fetch(TSlaveData &o)
> {
> EnterCriticalSection(&cs);
> BOOL res = !empty();
> if (res) {
> o = front();
> pop_front();
> }
> LeaveCriticalSection(&cs);
> return res;
> }
> private:
> CRITICAL_SECTION cs;
> } Bucket;
>
>
>
>
> --
> HLS

From: Peter Olcott on 23 Mar 2010 15:11

Ah so this is the code that you were suggesting?
I won't be able to implement multi-threading until volume
grows out of what a single core processor can accomplish.
I was simply going to use MySQL for the inter-process
communication, building and maintaining my FIFO queue.

One thing else that you may be unaware of std::vector
generally beats std::list even for list based algorithms,
including such things as inserting in the middle of the
list. The reason for this may be that the expensive memory
allocation cost is allocated over more elements with a
std::vector, more than enough to pay for the cost of
reshuffling a few items. This would probably not work for
very long lists. Also, maybe there is some sort of
std::list::reserve(), that would mitigate this cost.

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:O5O%23XiryKHA.5936(a)TK2MSFTNGP04.phx.gbl...
> Example usage of the class below, I added an Add()
> override to make it easier to add elements for the
> specific TSlaveData fields:
>
> #include <windows.h>
> #include <conio.h>
> #include <list>
> #include <string>
> #include <iostream>
>
> using namespace std;
>
> const DWORD MAX_JOBS = 10;
>
> typedef struct _tagTSlaveData {
> DWORD jid; // job number
> char szUser[256];
> char szPwd[256];
> char szHost[256];
> } TSlaveData;
>
> class CBucket : public std::list<TSlaveData>
> {
> public:
> CBucket() { InitializeCriticalSection(&cs); }
> ~CBucket() { DeleteCriticalSection(&cs); }
>
> void Add( const TSlaveData &o )
> {
> EnterCriticalSection(&cs);
> insert(end(), o );
> LeaveCriticalSection(&cs);
> }
>
> void Add(const DWORD jid,
> const char *user,
> const char *pwd,
> const char *host)
> {
> TSlaveData sd = {0};
> sd.jid = jid;
> strncpy(sd.szUser,user,sizeof(sd.szUser));
> strncpy(sd.szPwd,pwd,sizeof(sd.szPwd));
> strncpy(sd.szHost,host,sizeof(sd.szHost));
> Add(sd);
> }
>
> BOOL Fetch(TSlaveData &o)
> {
> EnterCriticalSection(&cs);
> BOOL res = !empty();
> if (res) {
> o = front();
> pop_front();
> }
> LeaveCriticalSection(&cs);
> return res;
> }
> private:
> CRITICAL_SECTION cs;
> } Bucket;
>
>
> void FillBucket()
> {
> for (int i = 0; i < MAX_JOBS; i++)
> {
> Bucket.Add(i,"user","password", "host");
> }
> }
>
> //----------------------------------------------------------------
> // Main Thread
> //----------------------------------------------------------------
>
> int main(char argc, char *argv[])
> {
>
> FillBucket();
> printf("Bucket Size: %d\n",Bucket.size());
> TSlaveData o = {0};
> while (Bucket.Fetch(o)) {
> printf("%3d | %s\n",o.jid, o.szUser);
> }
> return 0;
> }
>
> Your mongoose, OCR thingie, mongoose will Bucket.Add() and
> each spawned OCR thread will do a Bucket.Fetch().
>
> Do it right, it and ROCKS!
>
> --
> HLS
>
> Hector Santos wrote:
>
>> Peter Olcott wrote:
>>
>>> I still think that the FIFO queue is a good idea. Now I
>>> will have multiple requests and on multi-core machines
>>> multiple servers.
>>
>>
>> IMO, it just that its an odd approach to load balancing.
>> You are integrating software components, like a web
>> server with an multi-thread ready listening server and
>> you are hampering it with a single thread only FIFO
>> queuing. It introduces other design considerations.
>> Namely, you will need to consider a store and forward
>> concept for your request and delayed responses. But if
>> your request processing is very fast, maybe you don't
>> need to worry about it.
>>
>> In practice the "FIFO" would be at the socket level or
>> listening level with concepts dealing with load balancing
>> by restricting and balancing your connection with worker
>> pools or simply letting it to wait knowing that
>> processing won't be too long. Some servers have
>> guidelines for waiting limits. For the WEB, I am not
>> recall coming across any specific guideline other than a
>> practical one per implementation. The point is you don't
>> want the customers waiting too long - but what is "too
>> long."
>>
>>> What is your best suggestion for how I can implement the
>>> FIFO queue?
>>> (1) I want it to be very fast
>>> (2) I want it to be portable across Unix / Linux /
>>> Windows, and maybe even Mac OS X
>>> (3) I want it to be as robust and fault tolerant as
>>> possible.
>>
>>
>> Any good collection class will do as long as you wrap it
>> with synchronization. Example:
>>
>>
>> typedef struct _tagTSlaveData {
>> ... data per request.........
>> } TSlaveData;
>>
>> class CBucket : public std::list<TSlaveData>
>> {
>> public:
>> CBucket() { InitializeCriticalSection(&cs); }
>> ~CBucket() { DeleteCriticalSection(&cs); }
>>
>> void Add( const TSlaveData &o )
>> {
>> EnterCriticalSection(&cs);
>> insert(end(), o );
>> LeaveCriticalSection(&cs);
>> }
>>
>> BOOL Fetch(TSlaveData &o)
>> {
>> EnterCriticalSection(&cs);
>> BOOL res = !empty();
>> if (res) {
>> o = front();
>> pop_front();
>> }
>> LeaveCriticalSection(&cs);
>> return res;
>> }
>> private:
>> CRITICAL_SECTION cs;
>> } Bucket;
>>
>>
>>
>>
>
>
>
> --
> HLS

From: Hector Santos on 23 Mar 2010 15:19

OK Peter, Good Luck!

Peter Olcott wrote:

> Ah so this is the code that you were suggesting?
> I won't be able to implement multi-threading until volume
> grows out of what a single core processor can accomplish.
> I was simply going to use MySQL for the inter-process
> communication, building and maintaining my FIFO queue.
>
> One thing else that you may be unaware of std::vector
> generally beats std::list even for list based algorithms,
> including such things as inserting in the middle of the
> list. The reason for this may be that the expensive memory
> allocation cost is allocated over more elements with a
> std::vector, more than enough to pay for the cost of
> reshuffling a few items. This would probably not work for
> very long lists. Also, maybe there is some sort of
> std::list::reserve(), that would mitigate this cost.
>
> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:O5O%23XiryKHA.5936(a)TK2MSFTNGP04.phx.gbl...
>> Example usage of the class below, I added an Add()
>> override to make it easier to add elements for the
>> specific TSlaveData fields:
>>
>> #include <windows.h>
>> #include <conio.h>
>> #include <list>
>> #include <string>
>> #include <iostream>
>>
>> using namespace std;
>>
>> const DWORD MAX_JOBS = 10;
>>
>> typedef struct _tagTSlaveData {
>> DWORD jid; // job number
>> char szUser[256];
>> char szPwd[256];
>> char szHost[256];
>> } TSlaveData;
>>
>> class CBucket : public std::list<TSlaveData>
>> {
>> public:
>> CBucket() { InitializeCriticalSection(&cs); }
>> ~CBucket() { DeleteCriticalSection(&cs); }
>>
>> void Add( const TSlaveData &o )
>> {
>> EnterCriticalSection(&cs);
>> insert(end(), o );
>> LeaveCriticalSection(&cs);
>> }
>>
>> void Add(const DWORD jid,
>> const char *user,
>> const char *pwd,
>> const char *host)
>> {
>> TSlaveData sd = {0};
>> sd.jid = jid;
>> strncpy(sd.szUser,user,sizeof(sd.szUser));
>> strncpy(sd.szPwd,pwd,sizeof(sd.szPwd));
>> strncpy(sd.szHost,host,sizeof(sd.szHost));
>> Add(sd);
>> }
>>
>> BOOL Fetch(TSlaveData &o)
>> {
>> EnterCriticalSection(&cs);
>> BOOL res = !empty();
>> if (res) {
>> o = front();
>> pop_front();
>> }
>> LeaveCriticalSection(&cs);
>> return res;
>> }
>> private:
>> CRITICAL_SECTION cs;
>> } Bucket;
>>
>>
>> void FillBucket()
>> {
>> for (int i = 0; i < MAX_JOBS; i++)
>> {
>> Bucket.Add(i,"user","password", "host");
>> }
>> }
>>
>> //----------------------------------------------------------------
>> // Main Thread
>> //----------------------------------------------------------------
>>
>> int main(char argc, char *argv[])
>> {
>>
>> FillBucket();
>> printf("Bucket Size: %d\n",Bucket.size());
>> TSlaveData o = {0};
>> while (Bucket.Fetch(o)) {
>> printf("%3d | %s\n",o.jid, o.szUser);
>> }
>> return 0;
>> }
>>
>> Your mongoose, OCR thingie, mongoose will Bucket.Add() and
>> each spawned OCR thread will do a Bucket.Fetch().
>>
>> Do it right, it and ROCKS!
>>
>> --
>> HLS
>>
>> Hector Santos wrote:
>>
>>> Peter Olcott wrote:
>>>
>>>> I still think that the FIFO queue is a good idea. Now I
>>>> will have multiple requests and on multi-core machines
>>>> multiple servers.
>>>
>>> IMO, it just that its an odd approach to load balancing.
>>> You are integrating software components, like a web
>>> server with an multi-thread ready listening server and
>>> you are hampering it with a single thread only FIFO
>>> queuing. It introduces other design considerations.
>>> Namely, you will need to consider a store and forward
>>> concept for your request and delayed responses. But if
>>> your request processing is very fast, maybe you don't
>>> need to worry about it.
>>>
>>> In practice the "FIFO" would be at the socket level or
>>> listening level with concepts dealing with load balancing
>>> by restricting and balancing your connection with worker
>>> pools or simply letting it to wait knowing that
>>> processing won't be too long. Some servers have
>>> guidelines for waiting limits. For the WEB, I am not
>>> recall coming across any specific guideline other than a
>>> practical one per implementation. The point is you don't
>>> want the customers waiting too long - but what is "too
>>> long."
>>>
>>>> What is your best suggestion for how I can implement the
>>>> FIFO queue?
>>>> (1) I want it to be very fast
>>>> (2) I want it to be portable across Unix / Linux /
>>>> Windows, and maybe even Mac OS X
>>>> (3) I want it to be as robust and fault tolerant as
>>>> possible.
>>>
>>> Any good collection class will do as long as you wrap it
>>> with synchronization. Example:
>>>
>>>
>>> typedef struct _tagTSlaveData {
>>> ... data per request.........
>>> } TSlaveData;
>>>
>>> class CBucket : public std::list<TSlaveData>
>>> {
>>> public:
>>> CBucket() { InitializeCriticalSection(&cs); }
>>> ~CBucket() { DeleteCriticalSection(&cs); }
>>>
>>> void Add( const TSlaveData &o )
>>> {
>>> EnterCriticalSection(&cs);
>>> insert(end(), o );
>>> LeaveCriticalSection(&cs);
>>> }
>>>
>>> BOOL Fetch(TSlaveData &o)
>>> {
>>> EnterCriticalSection(&cs);
>>> BOOL res = !empty();
>>> if (res) {
>>> o = front();
>>> pop_front();
>>> }
>>> LeaveCriticalSection(&cs);
>>> return res;
>>> }
>>> private:
>>> CRITICAL_SECTION cs;
>>> } Bucket;
>>>
>>>
>>>
>>>
>>
>>
>> --
>> HLS
>
>

--
HLS

From: Peter Olcott on 23 Mar 2010 15:19

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:uJX%23LpryKHA.5288(a)TK2MSFTNGP05.phx.gbl...
> Peter O, we went thru this - it is only by "chance" that
> you are not seeing faults.
>
> Of course, good engineering and good hardware molds the
> chance. You

So its not really chance.

> got plenty of memory, no other hogging process is in play
> and in this test, you kept it at 1.5GB working set space
> which is good. With the QUAD, you can suspect to do four
> threads with 1.5GB each and now you have a 6GB demand.
> So you should see natural performance issues but hey, the
> machine can hide it to because its fast.
>
> But in general, this is where the MEMORY MAP comes in.
> Even if you did a single process, you can create a shared
> MMF dll that each EXE will share. That will keep it at
> 1.5GB instead of 6GB and now you may

The four instances of 1.5 GB were only for testing purposes,
intentionally trying to exceed the memory bandwidth.

If I was going to do it for real it would be one large
single std::vector of std::vectors shared across multiple
threads. I really don't want to deal with Virtual Memory
issues at all. I want to do everything that I reasonably can
to make them all moot. This may eventually require a
real-time OS.

> be able to work in even more threads. That is what the
> simulation helps you determine and remember the simulation
> helps gives you boundary conditions where you know it
> won't really work like that in reality.
>
> The shared mmf dll or region with a per process PER web
> site is how the browsrs like CHROME and IE 9.0 is going
> for better web browser performance and reliability per web
> site TAB. So this direction is not odd to do of your
> OCR is very work expensive.
>
>
> Peter Olcott wrote:
>
>> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in
>> message news:Ouu4JOryKHA.1236(a)TK2MSFTNGP06.phx.gbl...
>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>>> news:kMednXfoL9CQaTXWnZ2dnUVZ_tmdnZ2d(a)giganews.com...
>>>> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in
>>>> message news:efX%238fqyKHA.5360(a)TK2MSFTNGP06.phx.gbl...
>>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in
>>>>> message
>>>>> news:AeidnYxrl7T0vzXWnZ2dnUVZ_judnZ2d(a)giganews.com...
>>>>>> I don't want to hear about memory mapped files
>>>>>> because I don't want to hear about optimizing virtual
>>>>>> memory usage because I don't want to hear about
>>>>>> virtual memory until it is proven beyond all possible
>>>>>> doubt that my process does not (and can not be made
>>>>>> to be) resident in actual RAM all the time.
>>>>> From my understanding of your "test" (simply viewing
>>>>> the number of page faults reported by task manager)
>>>>> you can only conclude that there have not been any
>>>>> significant page faults since your application loaded
>>>>> the data, not that your application and data have
>>>>> remined in main memory. If you actually attempt to
>>>>> access all of your code and data and there are no page
>>>>> faults, I would be very surprised. In fact, knowing
>>>>> what I do about the cache management in Windows 7, I'm
>>>>> very surprised that you are not seeing any page faults
>>>>> at all unless you have disabled the caching service.
>>>>>
>>>>>> Since a test showed that my process did remain in
>>>>>> actual RAM for at least twelve hours,
>>>>> No. That is not what your simple test showed unless
>>>>> your actual test differed significantly from what you
>>>>> expressed here.
>>>>>
>>>>> -Pete
>>>>>
>>>> (1) I loaded my process
>>>> (2) I loaded my process data
>>>> (3) I waited twelve hours
>>>> (4) I executed my process using its loaded data, and
>>>> there were no page faults reported by the process
>>>> monitor
>>>> (5) Therefore my process data remained entirely
>>>> resident in actual RAM for at least twelve hours.
>>> What program is "process monitor"? Are you referring to
>>> the Sysinternals tool or are you referring to Task
>>> Manager or Resource Monitor?
>>>
>>> -Pete
>>
>> Task Manager
>> Process Tab
>> View
>> Select Columns
>> Page Faults
>
>
>
> --
> HLS

First | Prev | Next | Last
Pages: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system