Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Hector Santos on 23 Mar 2010 01:54

Peter Olcott wrote:

> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:etOekekyKHA.5036(a)TK2MSFTNGP02.phx.gbl...
>> Hmmmmm, you mean two threads in one process?
>>
>> What is this:
>>
>> num = Data[num]
>>
>> Do you mean:
>>
>> num = Data[i];
>
> No I mean it just like it is. I init all of memory with
> random numbers and then access the memory location
> references by these random numbers in a tight loop. This
> attempts to force memory bandwidth use to its limit. Even
> with four cores I do not reach the limit.

Ok.

> What are the heuristics for making a process thread safe?
> (1) keep all data in locals to the best extent possible.
> (2) Eliminate the need for global data that must be written
> to if possible.
> (3) Global data that must be read from is OK
> (4) Only use thread safe libraries.
>
> I think If I can follow all those rules, then the much more
> complex rules aren't even needed.
>
> Did I miss anything?

Thats correct above.

If the global data is READ ONLY and never changes you will be ok.

You can write to GLOBAL data only if you synchronize the
resource/object or whatever it is. You can use whats called
Reader/Writer locks to do this very efficiently. You will probably
have a need to pass/send results back to the calling thread or main
thread, or maybe display something. Depending on what you need here,
there are good solutions. Consider you the method I did with
TThreadData as the thread proc parameter. That can be global to
easily pass results from the thread.

Maybe some other things off the top of my head:

- Never try to use time to synchronize "things" or behavior. Use
Kernel objects (Mutexes, etc).

- Keep the local block data small. If it must be large, use the heap.

- Make sure the thread does not do a lot of context switching where
there is little work done.

--
HLS

From: Hector Santos on 23 Mar 2010 03:40

Peter Olcott wrote:

> "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
> news:etOekekyKHA.5036(a)TK2MSFTNGP02.phx.gbl...
>> Hmmmmm, you mean two threads in one process?
>>
>> What is this:
>>
>> num = Data[num]
>>
>> Do you mean:
>>
>> num = Data[i];
>
> No I mean it just like it is. I init all of memory with
> random numbers and then access the memory location
> references by these random numbers in a tight loop. This
> attempts to force memory bandwidth use to its limit. Even
> with four cores I do not reach the limit.

ok, but I guess I don't see this:

uint32 num;
for (uint32 r = 0; r < repeat; r++)
for (uint32 i = 0; i < size; i++)
num = Data[num];

num is not initialized, but we can assume its zero to start.

If you have 10 random number filled in:

2
1
0
5
0
9
0
0
4
5

then the iterations is:

num num = Data[num]
----- ---------------
0 2
2 0
0 2

and so on. You are potentially only hitting two spots.

The stress points come from having a large range and doing far jumps
and back that are beyond the working set. Read MSDN information on
GetProcessWorkingSetSize() and SetProcessWorkingSetSize(). Increasing
the minimum will bring in more data from vm, however, the OS may not
guarantee it. It works on a first come, first serve. Since you would
not have another 2nd INSTANCE, you have any competition, so it might
work very nicely for you.

But this pressure is what the Memory Load % will show.

If you serialize it from 0 to size, you will see that memory load
percentage value grow.

When it is random jumping, it will be lower because we are not
demanding data from VM.

You would be the judge of what better emulates your memory access, but
for stress simulation, you need to have it access the entire range.
The reality (when in production), you would not be as stress because
you won't be this stressing, so if you fine tune with the stress, your
program will ROCK!

PS: I noticed the rand() % size is too short, rand() is limited to
RAND_MAX which is 32K. Change that to:

(rand()*rand()) % size

to get random range from 0 to size-1. I think thats right, maybe Joe
can give us a good random generator here, but the above does seem to
provide a practical decent randomness for this task.

--
HLS

From: Hector Santos on 23 Mar 2010 04:55

Hector Santos wrote:

> PS: I noticed the rand() % size is too short, rand() is limited to
> RAND_MAX which is 32K. Change that to:
>
> (rand()*rand()) % size
>
> to get random range from 0 to size-1. I think thats right, maybe Joe
> can give us a good random generator here, but the above does seem to
> provide a practical decent randomness for this task.

Peter, using the above RNG seems to be a better test since it hits a
wider spectrum. With the earlier one, it was only hitting ranges
upto 32K.

I also notice when the 32K RNG was used, a dynamic array was 1 to 6
faster than using std::vector. But when using the above RNG, they
were both about the same. That is interesting.

--
HLS

From: Peter Olcott on 23 Mar 2010 10:16

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:u8xPnamyKHA.2552(a)TK2MSFTNGP04.phx.gbl...
> Hector Santos wrote:
>
>> PS: I noticed the rand() % size is too short, rand() is
>> limited to RAND_MAX which is 32K. Change that to:
>>
>> (rand()*rand()) % size
>>
>> to get random range from 0 to size-1. I think thats
>> right, maybe Joe can give us a good random generator
>> here, but the above does seem to provide a practical
>> decent randomness for this task.
>
> Peter, using the above RNG seems to be a better test since
> it hits a wider spectrum. With the earlier one, it was
> only hitting ranges upto 32K.
>
> I also notice when the 32K RNG was used, a dynamic array
> was 1 to 6 faster than using std::vector. But when using
> the above RNG, they were both about the same. That is
> interesting.
>
> --
> HLS

I made this adaptation and it slowed down by about 500%, a
much smaller cache hit ratio. It still scaled up to four
cores with 1.5 GB each, and four concurrent processes only
took about 50% more than a single process.

I will probably engineer my new technology to be able to
handle multiple threads, if all that I have to do is
implement the heuristics that I mentioned. Since my first
server will only have a single core, on this server it will
only have a single thread.

I still think that the FIFO queue is a good idea. Now I will
have multiple requests and on multi-core machines multiple
servers.

What is your best suggestion for how I can implement the
FIFO queue?
(1) I want it to be very fast
(2) I want it to be portable across Unix / Linux / Windows,
and maybe even Mac OS X
(3) I want it to be as robust and fault tolerant as
possible.

It may simply provide 32-bit hexadecimal integer names of
input PNG files. These names roll over to the next
incremental number. Instead they may be HTTP connection
numbers. I have to learn about HTTP before I will know what
I am doing here. Since all customers (including free trial
customers) will have accounts with valid email address, I
can always email the results if the connection is lost.

From: Pete Delgado on 23 Mar 2010 12:43

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:AeidnYxrl7T0vzXWnZ2dnUVZ_judnZ2d(a)giganews.com...
>
> I don't want to hear about memory mapped files because I don't want to
> hear about optimizing virtual memory usage because I don't want to hear
> about virtual memory until it is proven beyond all possible doubt that my
> process does not (and can not be made to be) resident in actual RAM all
> the time.

From my understanding of your "test" (simply viewing the number of page
faults reported by task manager) you can only conclude that there have not
been any significant page faults since your application loaded the data, not
that your application and data have remined in main memory. If you actually
attempt to access all of your code and data and there are no page faults, I
would be very surprised. In fact, knowing what I do about the cache
management in Windows 7, I'm very surprised that you are not seeing any page
faults at all unless you have disabled the caching service.

>
> Since a test showed that my process did remain in actual RAM for at least
> twelve hours,

No. That is not what your simple test showed unless your actual test
differed significantly from what you expressed here.

-Pete

First | Prev | Next | Last
Pages: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system