Processors stall on OLTP workloads about half the time--almost no matter what you do [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do

From: Anne & Lynn Wheeler on 22 Apr 2010 13:40

Robert Myers <rbmyersusa(a)gmail.com> writes:
> I had thought the idea of having lots of threads was precisely to get
> the memory requests out. You start a thread, get some memory requests
> out, and let it stall, because it's going to stall, anyway.
>
> Cache size and bandwidth and memory bandwidth are another matter.

in mid-70s, there was a multithreaded project for the 370/195 (that
never shipped). The 370/195 had 64 instruction pipeline, but no branch
prediction or speculative execution ... so common branches stalled the
pipeline. Highly tuned codes with some kinds of looping branches within
the pipeline could have peak thruput of 10mips ... however, branch
stalls in most code tended to hold thruput to five mips.

the objective of the emulated two-processor (double registers,
instruction address, etc ... but no additional pipeline or execution
units) was compensate for branch stalls (i.e. instructions, operations,
resources in the pipeline would have one-bit flag as to instruction
stream that they were associated with). Having a pair of instruction
streams with normal codes (peaking at 5mip/sec thruput) ... then had
chance of effectively utilizing/saturating the available 195 resources
(10mip aggregate).

however, retrofitting virtual memory to 370/195 was effectively
impossible ... so possibly accounted for it never getting out (original
370/195 tweaked the 360/195 with the original new announced 370 features
.... but that was before virtual memory was announced)

even retrofitting virtual memory to 370/165 was a very difficult task
.... and that difficulty accounted for dropping a lot of features in the
original 370 virtual memory architecture.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: Anne & Lynn Wheeler on 23 Apr 2010 09:32

Robert Myers <rbmyersusa(a)gmail.com> writes:
> This logic always made sense to me, but Nick claims it doesn't work.
> If it doesn't work, it has to be because of pressure on the cache or
> because the thread that stalls is holding a lock that the other thread
> needs.

re:
http://www.garlic.com/~lynn/2010h.html#44 Processors stall on OLTP workloads about half the time--almost no matter what you do
http://www.garlic.com/~lynn/2010h.html#45 Processors stall on OLTP workloads about half the time--almost no matter what you do

muliple processor operation introduces serialization operations that
don't exist in purely single processor operation. this can be as bad as
20-30 percent overhead increase. in a single processor case, this can
wipe out any expected benefits from running it as emulated two-processor
using processor threads. it isn't as much of a factor if already running
multi-processor operation (two or more real processors) and adding
emulated additional processors with hardware threads.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: George Neuner on 28 Apr 2010 14:33

On Tue, 27 Apr 2010 18:08:41 -0700 (PDT), Robert Myers
<rbmyersusa(a)gmail.com> wrote:

>On Apr 27, 5:08�pm, George Neuner <gneun...(a)comcast.net> wrote:
>> On Tue, 27 Apr 2010 12:56:14 -0400, Robert Myers
>
>> I'm not really seeking a discussion on all of this because it will
>> quickly become very technical (and redundant as some of the things
>> have been discussed in comp.compilers). �I just wanted more
>> information on what Andy was doing because his description sounded
>> interesting.
>
>I'll probably have a look at what might have been said on
>comp.compilers, but, as to your tone, this list is *not*
>comp.compilers. Comp.arch has had long dry spells. At least people
>are talking. If you need a place to be pompous, I suggest you choose
>a moderated list where you are a part of the moderator's club.
>
>Robert.

I apologize for my choice of words. I didn't mean to be pompous or
for there to be any tone wrt the forum ... I really only meant to
convey that a discussion would be off-topic here.

George

From: George Neuner on 28 Apr 2010 15:36

On Tue, 27 Apr 2010 18:08:41 -0700 (PDT), Robert Myers
<rbmyersusa(a)gmail.com> wrote:

>Most of the work I'm aware of is aimed at identifying those execution
>paths that can be speculatively executed to speed up garden variety
>computation with what were at the time standard test cases (gcc, bzip,
>etc.). The speculative paths are set up by the compiler without
>programmer intervention, other than making required profiling runs.

Yes, I've seen some of that work. The original lines of research
pretty much dried up with the general adoption of hardware branch
speculation (the ability to conditionally execute instructions while
waiting for a branch condition to resolve and to abort the path if the
branch goes against it).

What remains mostly is research into ways of recognizing repetitious
patterns of data access in linked data structures (lists, trees,
graphs, tries, etc.) and automatically prefetching data in advance of
its use. I haven't followed this research too closely, but my
impression is that it remains a hard problem.

George

From: George Neuner on 29 Apr 2010 15:37

On Wed, 28 Apr 2010 17:35:32 -0700 (PDT), Robert Myers
<rbmyersusa(a)gmail.com> wrote:

>On Apr 28, 3:36�pm, George Neuner <gneun...(a)comcast.net> wrote:
>
>> What remains mostly is research into ways of recognizing repetitious
>> patterns of data access in linked data structures (lists, trees,
>> graphs, tries, etc.) and automatically prefetching data in advance of
>> its use. �I haven't followed this research too closely, but my
>> impression is that it remains a hard problem.
>
>I suspect that explains a mysterious private email I got while
>publicly discussing Itanium and profile-directed optimization. The
>email claimed that a well-known compiler developer that he worked for
>had found means to predict irregular data access from static analysis
>so that the compiler could supply prefetch hints even for an irregular
>memory stride.

Interesting ... if it's true it's the first I heard about it. I've
read about success prefetching in lists and in search trees (although
prefetching N-way trees with large N creates a cache pollution
problem), but AFAIK prefetching in more general graph structures has
eluded a practical solution.

George

| Next | Last
Pages: 1 2
Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do