Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Woody on 22 Mar 2010 03:49

On Mar 21, 11:19 am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way.

Unfortunately, you cannot think of a single "memory bandwidth" as
being the limiting factor. The cache behavior is the biggest
determinant of speed, because reading/writing is much faster to cache
than to main memory (that's why cache is there). To truly optimize an
application, and to answer your question about more threads, you must
consider the low-level details of memory usage, such as, what is the
size of the cache lines? How is memory interleaved? What is the size
of the translation look-aside buffer? How is cache shared among the
cores? Is there one memory controller per processor (if you have
multiple processors), or per core?

There are tools such as AMD CodeAnalyst (free) or Intel VTune ($$$)
that measure these things. Once you know where the bottlenecks really
are, you can go to work rearranging your code to keep all the
computer's resources busy. You will need to run all the tests on your
actual app, or something close to it, for meaningful results.

BTW, the same details that determine memory speed also make the
comparison of CPU speed meaningless.

From: Peter Olcott on 22 Mar 2010 07:28

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:%23IrS0xYyKHA.2012(a)TK2MSFTNGP04.phx.gbl...
> Here is the result using a 1.5GB readonly memory mapped
> file. I started with 1 single process thread, then switch
> to 2 threads, then 4, 6, 8, 10 and 12 threads. Notice how
> the processing time for the earlier threads started high
> but decreased with the later thread. This was the caching
> effect of the readonly memory file. Also note the Global
> Memory Status *MEMORY LOAD* percentage. For my machine, it
> is at 19% at steady state. But as expected it shuts up
> when dealing with this large memory map file. I probably
> can fine tune the map views better, but they are set as
> read only. Well, I'll leave OP to figure out memory maps
> coding for his patented DFA meta file process.
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 0
> ---------------------------------------
> Time: 2984 | Elapsed: 0
> ---------------------------------------
> Total Client Time: 2984
>
> V:\wc5beta>testpeter3t /s:3000000 /t:2 /r:1
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> * Wait For Thread Completion
> - Memory Load: 96%
> * Done
> ---------------------------------------
> 0 | Time: 5407 | Elapsed: 0
> 1 | Time: 4938 | Elapsed: 0
> ---------------------------------------
> Total Time: 10345
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:4
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> - Resuming thread# 2 in 334 msecs.
> - Resuming thread# 3 in 500 msecs.
> * Wait For Thread Completion
> - Memory Load: 97%
> * Done
> ---------------------------------------
> 0 | Time: 6313 | Elapsed: 0
> 1 | Time: 5844 | Elapsed: 0
> 2 | Time: 5500 | Elapsed: 0
> 3 | Time: 5000 | Elapsed: 0
> ---------------------------------------
> Total Time: 22657
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:6
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> - Creating thread 4
> - Creating thread 5
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> - Resuming thread# 2 in 334 msecs.
> - Resuming thread# 3 in 500 msecs.
> - Resuming thread# 4 in 169 msecs.
> - Resuming thread# 5 in 724 msecs.
> * Wait For Thread Completion
> - Memory Load: 97%
> * Done
> ---------------------------------------
> 0 | Time: 6359 | Elapsed: 0
> 1 | Time: 5891 | Elapsed: 0
> 2 | Time: 5547 | Elapsed: 0
> 3 | Time: 5047 | Elapsed: 0
> 4 | Time: 4875 | Elapsed: 0
> 5 | Time: 4141 | Elapsed: 0
> ---------------------------------------
> Total Time: 31860
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:8
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 16
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> - Creating thread 4
> - Creating thread 5
> - Creating thread 6
> - Creating thread 7
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> - Resuming thread# 2 in 334 msecs.
> - Resuming thread# 3 in 500 msecs.
> - Resuming thread# 4 in 169 msecs.
> - Resuming thread# 5 in 724 msecs.
> - Resuming thread# 6 in 478 msecs.
> - Resuming thread# 7 in 358 msecs.
> * Wait For Thread Completion
> - Memory Load: 96%
> * Done
> ---------------------------------------
> 0 | Time: 6203 | Elapsed: 0
> 1 | Time: 5734 | Elapsed: 0
> 2 | Time: 5391 | Elapsed: 0
> 3 | Time: 4891 | Elapsed: 0
> 4 | Time: 4719 | Elapsed: 0
> 5 | Time: 3984 | Elapsed: 0
> 6 | Time: 3500 | Elapsed: 0
> 7 | Time: 3125 | Elapsed: 0
> ---------------------------------------
> Total Time: 37547
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:10
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 0
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> - Creating thread 4
> - Creating thread 5
> - Creating thread 6
> - Creating thread 7
> - Creating thread 8
> - Creating thread 9
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> - Resuming thread# 2 in 334 msecs.
> - Resuming thread# 3 in 500 msecs.
> - Resuming thread# 4 in 169 msecs.
> - Resuming thread# 5 in 724 msecs.
> - Resuming thread# 6 in 478 msecs.
> - Resuming thread# 7 in 358 msecs.
> - Resuming thread# 8 in 962 msecs.
> - Resuming thread# 9 in 464 msecs.
> * Wait For Thread Completion
> - Memory Load: 97%
> * Done
> ---------------------------------------
> 0 | Time: 7234 | Elapsed: 0
> 1 | Time: 6766 | Elapsed: 0
> 2 | Time: 6422 | Elapsed: 0
> 3 | Time: 5922 | Elapsed: 0
> 4 | Time: 5750 | Elapsed: 0
> 5 | Time: 5016 | Elapsed: 0
> 6 | Time: 4531 | Elapsed: 0
> 7 | Time: 4125 | Elapsed: 0
> 8 | Time: 3203 | Elapsed: 0
> 9 | Time: 2703 | Elapsed: 0
> ---------------------------------------
> Total Time: 51672
>
> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:12
> - size : 3000000
> - memory : 1536000000 (1500000K)
> - repeat : 1
> - Memory Load : 25%
> - Allocating Data .... 16
> * Starting threads
> - Creating thread 0
> - Creating thread 1
> - Creating thread 2
> - Creating thread 3
> - Creating thread 4
> - Creating thread 5
> - Creating thread 6
> - Creating thread 7
> - Creating thread 8
> - Creating thread 9
> - Creating thread 10
> - Creating thread 11
> * Resuming threads
> - Resuming thread# 0 in 41 msecs.
> - Resuming thread# 1 in 467 msecs.
> - Resuming thread# 2 in 334 msecs.
> - Resuming thread# 3 in 500 msecs.
> - Resuming thread# 4 in 169 msecs.
> - Resuming thread# 5 in 724 msecs.
> - Resuming thread# 6 in 478 msecs.
> - Resuming thread# 7 in 358 msecs.
> - Resuming thread# 8 in 962 msecs.
> - Resuming thread# 9 in 464 msecs.
> - Resuming thread# 10 in 705 msecs.
> - Resuming thread# 11 in 145 msecs.
> * Wait For Thread Completion
> - Memory Load: 97%
> * Done
> ---------------------------------------
> 0 | Time: 7984 | Elapsed: 0
> 1 | Time: 7515 | Elapsed: 0
> 2 | Time: 7188 | Elapsed: 0
> 3 | Time: 6672 | Elapsed: 0
> 4 | Time: 6500 | Elapsed: 0
> 5 | Time: 5781 | Elapsed: 0
> 6 | Time: 5250 | Elapsed: 0
> 7 | Time: 4953 | Elapsed: 0
> 8 | Time: 3953 | Elapsed: 0
> 9 | Time: 3484 | Elapsed: 0
> 10 | Time: 2750 | Elapsed: 0
> 11 | Time: 2547 | Elapsed: 0
> ---------------------------------------
> Total Time: 64577
>
>
> --
> HLS

OK and where is the summary conclusion?
Also by using a memory mapped file your process would have
entirely different behavior than mine.

I known that it is possible that you could have been right
all along about this, and I could be wrong. I know this
because of a term that I coined. [Ignorance Squared].

[Ignorance Squared] is the process by which a lack of
understanding is perceived by the one whom lacks this
understanding as disagreement. Whereas the one whom has
understanding knows that the ignorant person is lacking
understanding the ignorant person lacks this insight, and is
thus ignorant even of their own ignorance, hence the term
[Ignorance Squared] .

Now that I have a way to empirically validate your theories
against mine (that I dreamed up last night while sleeping) I
will do this.

From: Peter Olcott on 22 Mar 2010 07:33

It is very hard to reply to messages with quoting turned
off, please turn quoting on. Also please tell me how quoting
gets turned off.

When a process requires continual essentially random access
to data that is very much larger than the largest cache,
then I think that memory bandwidth could be a limiting
factor to performance.

"Woody" <ols6000(a)sbcglobal.net> wrote in message
news:7ff25c57-b2a7-4b31-b3df-bebcf34ead80(a)d37g2000yqn.googlegroups.com...
On Mar 21, 11:19 am, "Peter Olcott" <NoS...(a)OCR4Screen.com>
wrote:
> I have an application that uses enormous amounts of RAM in
> a
> very memory bandwidth intensive way.

Unfortunately, you cannot think of a single "memory
bandwidth" as
being the limiting factor. The cache behavior is the biggest
determinant of speed, because reading/writing is much faster
to cache
than to main memory (that's why cache is there). To truly
optimize an
application, and to answer your question about more threads,
you must
consider the low-level details of memory usage, such as,
what is the
size of the cache lines? How is memory interleaved? What is
the size
of the translation look-aside buffer? How is cache shared
among the
cores? Is there one memory controller per processor (if you
have
multiple processors), or per core?

There are tools such as AMD CodeAnalyst (free) or Intel
VTune ($$$)
that measure these things. Once you know where the
bottlenecks really
are, you can go to work rearranging your code to keep all
the
computer's resources busy. You will need to run all the
tests on your
actual app, or something close to it, for meaningful
results.

BTW, the same details that determine memory speed also make
the
comparison of CPU speed meaningless.

From: Joseph M. Newcomer on 22 Mar 2010 10:31

See below...
On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com...
>> NOte in the i7 architecture the L3 cache is shared across
>> all CPUs, so you are less likely
>> to be hit by raw memory bandwidth (which compared to a CPU
>> is dead-slow), and the answer s
>> so whether multiple threads will work effectively can only
>> be determined by measurement of
>> a multithreaded app.
>>
>> Because your logic seems to indicate that raw memory speed
>> is the limiting factor, and you
>> have not accounted for the effects of a shared L3 cache,
>> any opnion you offer on what is
>> going to happen is meaningless. In fact, any opinion
>> about performanance is by definition
>> meaningless; only actual measurements represent facts ("If
>> you can't express it in
>> numbers, it ain't science, it's opinion" -- Robert A.
>> Heinlein)
>
>(1) Machine A performs process B in X minutes.
>(2) Machine C performs process B in X/8 Minutes (800%
>faster)
>(3) The only difference between machine A and machine C is
>that machine C has much faster access to RAM (by whatever
>means).
>(4) Therefore Process B is memory bandwidth bound.
***
Fred can dig a ditch 10 feet long in 1 hour. Charlie can dig a ditch 10 feet long in 20
minutes. Therefore, Charlie is faster than Fred by a factor of 3.

How long does it take Fred and Charlie working together to dig a ditch 10 feet long?
(Hint: any mathematical answer you come up with is wrong, because Fred and Charlie (a)
hate each other, and so Charlie tosses his dirt into the place Fred has to dig or (b) are
good buddies and stop for a beer halfway through the digging or (c) Chalie tells Fred he
can do it faster by himself, and Fred just sits there while Charlie does all the work and
finishes in 20 minutes, after which they go out for a beer. Fred buys.

You have made an obvious failure here in thinking that if one thread takes 1/k the time
and the only difference is memory bandwidth, that two threads are necessarily LINEAR. Duh!
IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO DATA! You are jumping to an
unwarranted conclusion based on what I can at best tell is a coincidence. And even if it
was true, caches give nonlinear effects, so you are not even making sense when you make
these assertions! You have proven a case for value N, but you have immediately assumed
that if you prove the case for N, you have proven it for case N+1, which is NOT how
inductive proofs work! So you were so hung up on geometric proofs, can you explain how,
when doing an inductive proof, that proving the case for 1 element tells you what the
result is for N+1 for arbitrary value N? Hell, it doesn't even tell you the results for
N=1, but you have immediately assumed that it is a valid proof for all values of N!

YOU HAVE NO DATA! You are making a flawed assumption of linearity that has no basis!
Going to your fixation on proof, in a nonlinear system without a closed-form analytic
solution, demonstrate to me that your only possible solution is based on a linear
assumption. You are ignoring all forms of reality here. You are asseting without basis
that the system is linear (it is known that systems with caches are nonlinear in memory
performance). So you are contradicting known reality without any evidence to support your
"axiom". It ain't an axiom, it's a wild-assed-guess.

Until you can demonstrate with actual measured performance that your system is COMPLETELY
linear behavior in an L3 cache system, there is no reason to listen to any of this
nonsense you keep esposusing as if it were "fact". You have ONE fact, and that is not
enough to raise your hypothesis to the level of "axiom".

All you have proven is that a single thread is limited by memory bandwidth. You have no
reason to infer that two threads will not BOTH run faster because of the L3 cache effects.
And you have ignored L1/L2 cache effects. You have a trivial example from which NOTHING
can be inferred about multithreaded performance. You have consistently confused
multiprocess programming with multithreading and arrived at erroneous conclusions based on
flawed experiments.

Note also if you use a memory-mapped file and two processes share the same mapping object
there is only one copy of the data in memory! THis has not previously come up in
discussions, but could be critical to your performance of multiple processes.
joe
****
>
>> More below...
>> On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>>>I have an application that uses enormous amounts of RAM in
>>>a
>>>very memory bandwidth intensive way. I recently upgraded
>>>my
>>>hardware to a machine with 600% faster RAM and 32-fold
>>>more
>>>L3 cache. This L3 cache is also twice as fast as the prior
>>>machines cache. When I benchmarked my application across
>>>the
>>>two machines, I gained an 800% improvement in wall clock
>>>time. The new machines CPU is only 11% faster than the
>>>prior
>>>machine. Both processes were tested on a single CPU.
>> ***
>> The question is whether you are measuring multiple threads
>> in a single executable image
>> across multiple cores, or multiple executable images on a
>> single core. Not sure how you
>> know that both processes were tested on a single CPU,
>> since you don't mention how you
>> accomplished this (there are several techniques, but it is
>> important to know which one you
>> used, since each has its own implications for predicting
>> overall behavior of a system).
>> ****
>>>
>>>I am thinking that all of the above would tend to show
>>>that
>>>my process is very memory bandwidth intensive, and thus
>>>could not benefit from multiple threads on the same
>>>machine
>>>because the bottleneck is memory bandwidth rather than CPU
>>>cycles. Is this analysis correct?
>> ****
>> Nonsense! You have no idea what is going on here! The
>> shared L3 cache could completely
>> wipe out the memory performance issue, reducing your
>> problem to a cache-performance issue.
>> Since you have not conducted the experiment in multiple
>> threading, you have no data to
>> indicate one way or the other what is going on, and it is
>> the particular memory access
>> patterns of YOUR app that matter, and therefore, nobody
>> can offer a meaningful estimate
>> based on your L1/L2/L3 cache accessses, whatever they may
>> be.
>> joe
>> ****
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 22 Mar 2010 11:02

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:ioueq5hdsf5ut5pha6ttt88e1ghl4q9l1m(a)4ax.com...
> See below...
> On Sun, 21 Mar 2010 21:06:20 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:vmvcq55tuhj1lunc6qcdi9uejup4jg1i4e(a)4ax.com...
>>> NOte in the i7 architecture the L3 cache is shared
>>> across
>>> all CPUs, so you are less likely
>>> to be hit by raw memory bandwidth (which compared to a
>>> CPU
>>> is dead-slow), and the answer s
>>> so whether multiple threads will work effectively can
>>> only
>>> be determined by measurement of
>>> a multithreaded app.
>>>
>>> Because your logic seems to indicate that raw memory
>>> speed
>>> is the limiting factor, and you
>>> have not accounted for the effects of a shared L3 cache,
>>> any opnion you offer on what is
>>> going to happen is meaningless. In fact, any opinion
>>> about performanance is by definition
>>> meaningless; only actual measurements represent facts
>>> ("If
>>> you can't express it in
>>> numbers, it ain't science, it's opinion" -- Robert A.
>>> Heinlein)
>>
>>(1) Machine A performs process B in X minutes.
>>(2) Machine C performs process B in X/8 Minutes (800%
>>faster)
>>(3) The only difference between machine A and machine C is
>>that machine C has much faster access to RAM (by whatever
>>means).
>>(4) Therefore Process B is memory bandwidth bound.
> ***
> Fred can dig a ditch 10 feet long in 1 hour. Charlie can
> dig a ditch 10 feet long in 20
> minutes. Therefore, Charlie is faster than Fred by a
> factor of 3.
>
> How long does it take Fred and Charlie working together to
> dig a ditch 10 feet long?
> (Hint: any mathematical answer you come up with is wrong,
> because Fred and Charlie (a)
> hate each other, and so Charlie tosses his dirt into the
> place Fred has to dig or (b) are
> good buddies and stop for a beer halfway through the
> digging or (c) Chalie tells Fred he
> can do it faster by himself, and Fred just sits there
> while Charlie does all the work and
> finishes in 20 minutes, after which they go out for a
> beer. Fred buys.
>
> You have made an obvious failure here in thinking that if
> one thread takes 1/k the time
> and the only difference is memory bandwidth, that two
> threads are necessarily LINEAR. Duh!
> IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO
> DATA! You are jumping to an

(1) People in a more specialized group are coming to the
same conclusions that I have derived.

(2) When a process requires essentially random (mostly
unpredictable) access to far more memory than can possibly
fit into the largest cache, then actual memory access time
becomes a much more significant factor in determining actual
response time.

> unwarranted conclusion based on what I can at best tell is
> a coincidence. And even if it
> was true, caches give nonlinear effects, so you are not
> even making sense when you make
> these assertions! You have proven a case for value N, but
> you have immediately assumed
> that if you prove the case for N, you have proven it for
> case N+1, which is NOT how
> inductive proofs work! So you were so hung up on
> geometric proofs, can you explain how,
> when doing an inductive proof, that proving the case for 1
> element tells you what the
> result is for N+1 for arbitrary value N? Hell, it
> doesn't even tell you the results for
> N=1, but you have immediately assumed that it is a valid
> proof for all values of N!
>
> YOU HAVE NO DATA! You are making a flawed assumption of
> linearity that has no basis!
> Going to your fixation on proof, in a nonlinear system
> without a closed-form analytic
> solution, demonstrate to me that your only possible
> solution is based on a linear
> assumption. You are ignoring all forms of reality here.
> You are asseting without basis
> that the system is linear (it is known that systems with
> caches are nonlinear in memory
> performance). So you are contradicting known reality
> without any evidence to support your
> "axiom". It ain't an axiom, it's a wild-assed-guess.
>
> Until you can demonstrate with actual measured performance
> that your system is COMPLETELY
> linear behavior in an L3 cache system, there is no reason
> to listen to any of this
> nonsense you keep esposusing as if it were "fact". You
> have ONE fact, and that is not
> enough to raise your hypothesis to the level of "axiom".
>
> All you have proven is that a single thread is limited by
> memory bandwidth. You have no
> reason to infer that two threads will not BOTH run faster
> because of the L3 cache effects.
> And you have ignored L1/L2 cache effects. You have a
> trivial example from which NOTHING
> can be inferred about multithreaded performance. You have
> consistently confused
> multiprocess programming with multithreading and arrived
> at erroneous conclusions based on
> flawed experiments.
>
> Note also if you use a memory-mapped file and two
> processes share the same mapping object
> there is only one copy of the data in memory! THis has
> not previously come up in
> discussions, but could be critical to your performance of
> multiple processes.
> joe
> ****
>>
>>> More below...
>>> On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>I have an application that uses enormous amounts of RAM
>>>>in
>>>>a
>>>>very memory bandwidth intensive way. I recently
>>>>upgraded
>>>>my
>>>>hardware to a machine with 600% faster RAM and 32-fold
>>>>more
>>>>L3 cache. This L3 cache is also twice as fast as the
>>>>prior
>>>>machines cache. When I benchmarked my application across
>>>>the
>>>>two machines, I gained an 800% improvement in wall clock
>>>>time. The new machines CPU is only 11% faster than the
>>>>prior
>>>>machine. Both processes were tested on a single CPU.
>>> ***
>>> The question is whether you are measuring multiple
>>> threads
>>> in a single executable image
>>> across multiple cores, or multiple executable images on
>>> a
>>> single core. Not sure how you
>>> know that both processes were tested on a single CPU,
>>> since you don't mention how you
>>> accomplished this (there are several techniques, but it
>>> is
>>> important to know which one you
>>> used, since each has its own implications for predicting
>>> overall behavior of a system).
>>> ****
>>>>
>>>>I am thinking that all of the above would tend to show
>>>>that
>>>>my process is very memory bandwidth intensive, and thus
>>>>could not benefit from multiple threads on the same
>>>>machine
>>>>because the bottleneck is memory bandwidth rather than
>>>>CPU
>>>>cycles. Is this analysis correct?
>>> ****
>>> Nonsense! You have no idea what is going on here! The
>>> shared L3 cache could completely
>>> wipe out the memory performance issue, reducing your
>>> problem to a cache-performance issue.
>>> Since you have not conducted the experiment in multiple
>>> threading, you have no data to
>>> indicate one way or the other what is going on, and it
>>> is
>>> the particular memory access
>>> patterns of YOUR app that matter, and therefore, nobody
>>> can offer a meaningful estimate
>>> based on your L1/L2/L3 cache accessses, whatever they
>>> may
>>> be.
>>> joe
>>> ****
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system