Why so little parallelism? [Computer Architecture]

Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler

From: Stefan Monnier on 12 Dec 2006 10:36

> I remember when exactly the same argument was used to claim that all
> serious HPC programs should be coded in assembler, because relying on
> the compiler's optimisation was the wrong way to proceed :-)

I'm all for (very) high level languages and idioms. And they should come
with (optional) annotations about what kind of optimization/performance is
expected from the code. It is important that those annotations do not
change the semantics of the code itself.

> In the case of simple array operations, a run-time system is likely to
> code-generate better than a programmer. Where it falls down is in
> deciding how to distribute the array - and that is an unsolved problem,
> whether it be done manually or automatically, despite many attempts
> at systematising the issue.

Indeed. Which is why I think the only good way to go about it is to help
the programmer understand the resulting performance, so she can tweak the
distribution in the right direction (if there is such a thing).
That requires control of the distribution, but also requires relating the
code's performance to its source, so that she can decide which part
to change.

Maybe instead of source-code annotations, the thing I'm looking for is
a source-level debugger, where the "bugs" I'm after are performance bugs and
the tool helps me relate it to the source code (I just described a profiler,
didn't I, hmm). Together with the equivalent of `assert' but where the
assertion is about the program's performance.

Stefan

PS: I know I'm not making much sense, sorry. These ideas aren't even
half-baked.

From: ChrisQuayle on 12 Dec 2006 11:18

Eugene Miya wrote:
> In article <el1suk$4co$1(a)gemini.csx.cam.ac.uk>,
> Nick Maclaren <nmm1(a)cus.cam.ac.uk> wrote:
>
>>It may have been implemented since, but constraints have changed.
>>X remains a system killer, even under Unix, and Microsoft's clone
>>of Presentation Manager is no better (well, PM itself wasn't much
>>better).
>
>
> Huh. That's this got to do with parallelism?
> Athena was a loser system and IBM's "help" didn't.
> That was fairly evident at MIT even at the time.
>

I would have thought graphics work was an ideal match for parallel
processing. Modifying or other manipulation of on screen images etc.

X is a system killer though, not because of bad design, it is quite
elegant, but because windowing systems of any kind are very compute
intensive and need lots of memory for the higher resolutions. Having
programmed X and written a simple windowing gui for embedded work, I
speak from experience. The amount of code that needs to be executed just
to get a window onscreen is quite substantial. Ok, only a few system
calls at api level, but the devil is all in the internals...

Chris

From: Eugene Miya on 12 Dec 2006 11:23

In article <yeadnXYnjLbPXuDYnZ2dnUVZ_tyinZ2d(a)metrocastcablevision.com>,
Bill Todd <billtodd(a)metrocast.net> wrote:
>Eugene Miya wrote:
>> In article <_oOdneG2v-ejCODYnZ2dnUVZ_rOqnZ2d(a)metrocastcablevision.com>,
>> Bill Todd <billtodd(a)metrocast.net> wrote:
>>> Del Cecchi wrote:
>>> And Threads? Aren't
>>>> they just parallel sugar on a serial mechanism?
>>> Not when each is closely associated with a separate hardware execution
>>> context.
>>
>> Threads are just lightweight processes.
>
>Irrelevant to what you were purportedly responding to.
>
>> Most people don't see the baggage which gets copied when an OS like Unix
>> forks(2). And that fork(2) is light weight compared to the old style VMS
>> spawn and the IBM equivalents.
>
>Also irrelevant to what you were purportedly responding to.
>
>When (as I said, but you seem to have ignored) each thread is closely
>associated with a *separate* hardware execution context, it's simply the
>software vehicle for using that execution context in parallel with other
>execution contexts.

It's completely relevant.
What do you think hardware context is?

>>> And when multiple threads are used on a single hardware
>>> execution context to avoid explicitly asynchronous processing (e.g., to
>>> let the processor keeping doing useful work on something else while one
>>> logical thread of execution is waiting for something to happen - without
>>> disturbing that logical serial process flow), that seems more like
>>> serial sugar on a parallel mechanism to me.
>>
>> Distributed memory or shared memory?
>
>Are you completely out to lunch today? Try reading what I said again.

I did.
And you've never used parallel machines?
What do you think context is chopped liver?

>>> Until individual processors stop being serial in the nature of the way
>>> they execute code, I'm not sure how feasible getting rid of ideas like
>>> 'threads' will be (at least at some level, though I've never
>>> particularly liked the often somewhat inefficient use of them to avoid
>>> explicit asynchrony).
>>
>> What's their nature?
>
>To execute a serial stream of instructions, modulo the explicit
>disruptions of branches and subroutines and hardware interrupt
>facilities (which themselves simply suspend one serial thread of
>execution for another). At least if one is talking about 99.99+% of the
>processors in use today (and is willing to call SMT cores multiple
>processors in this regard, which given the context is hardly
>unreasonable). The fact that they may take advantage of peephole
>optimization to reorder some of the execution is essentially under the
>covers: the paradigm which they present to the outside world is serial
>in nature, and constructs like software threads follow fairly directly
>from it.

Do you know anything at all about program counters, data flow, and
operating systems?

--

From: Eugene Miya on 12 Dec 2006 12:14

In article <jwvk60xzuy7.fsf-monnier+comp.arch(a)gnu.org>,
Stefan Monnier <monnier(a)iro.umontreal.ca> wrote:
>>>> PARALLEL-FOR(20%) i = 1 TO 50 WITH DO
>>>> dosomething with i
>>>> DONE
>
>> What's 20%?
>
>The expected efficiency.

Yeah Stefan, but how do you determine that?
Can I get more that 100% for instance? Can I get 200%

>> As the Cray and subsequent guys have learned:
>> you are assuming, for instance, no interactions of i on the LHS with
>> i-1 on the RHS.
>
>Not at all. All the annotation here is saying is "I expect this code to
>have at least 20% efficiency", so if inter-iteration dependencies prevent
>such efficiency, it's a bug that should be reported.
>
>This is just one random example thrown in. Other things would be to make
>inter-iteration dependencies explicit, so the compiler would only have to
>check them rather than infer them. After all, the programmer has to be
>aware of them to get good performance anyway, so let him write down what he
>knows so it can be sanity-checked.

You mean like a C$dir?

>> A couple of decades ago, Dave Kuck detailed a survey of all the problems
>> needed in parallel software as an opening session of an ICPP.
>> Unfortunately that paper is hard to find (it's like 1975 + or minus a year
>> or 2).
>
>Where could it be found (my library doesn't seem to carry such things)?

%A David J. Kuck
%T Parallel processor architecture -- a survey
%J Proceedings of the Sagamore Computer Conference
(3rd ICPP'75)
%I IEEE and Springer-Verlag
%C NY
%D August 1975
%P 15-39

While not current, he's quite general enough for the basic problems
(language independent).
Other decent surveys exist, but I think all are inadequate to the
job because the problem is still a constrainted

So ILL is (inter library loan) is likely needed. The last copy I saw of
thise was at Stanford, but clearly the UIUC where Kuck taught as well as
other places should have it.

>> So you are about 1974 compiler non-UIUC technology.
>
>That wouldn't surprise me, although I feel like we haven't made much (if
>any) progress in this area.

The progress is constrainted.

Like Terman said: No number of 6 foot jumpers equals a 7 foot jumper.

>> I'm not certain how compilers estimate efficiency. It's barely
>> recognized in the community ("cycles for free").
>
>Indeed, and I believe this is the problem.

I think we have to get it running well inefficiently before we can
optimize. We are barely capable of doing that. Better problem surveys
like the old reviews by Jones:

%A Anita K. Jones
%A Edward F. Gehringer, eds.
%T The Cm* Multiprocessor Project: A Research Review
%R CMU-CS-80-131
%I Department of Computer Science, Carnegie-Mellon University
%C Pittsburgh, PA
%D July 1980
%K bhibbard
%X Detailed discussion of the Cm* hardware, microcode for Kmaps,
performance evaluation, and an overview of Medusa and StarOS.
%X I used this entry for verification with commercial library database
services: NASA RECON, DIALOG using NTIS, INSPEC, COMPENDIX, Comp. DB.
It was in RECON and NTIS but lacking in all the others.
I selected this report because it had good content, significance, and
was not published in a journal (too long). Represented a representative
technical report (things normally not found in libraries.
RECON had (additionally):
Keywords: algorithms, computer techniques, multiprocessing (computers),
computer programs, project management, system engineering
NTIS had (additionally):
NTIS Price: PC A10 MF A01 Announcement GRAI8315, Language is English
Section heading: 9B electronics and electrical engineering - computers,
62B computers, control, and information theory - computer software,
NTISDODXA, multiprocessors, machine coding, computer programs,
reliability (electronics), computer applications (twice, a typo),
algorithms,
system engineering, comparison
Points out in the abstract it was an operational 50 experimental
processor system, disk controller existed, with two operating systems
StarOS and Medusa, etc. (abstract only summarized to avoid copyright).

%A Anita K. Jones
%A Peter Schwarz
%i DCS, CMU, Pittsburgh, PA
%T Experience Using Multiprocessor Systems: A Status Report
%J Computing Surveys
%V 12
%N 2
%D June 1980
%P 121-165
%r CMU-CS-79-146
%d oct. '79
%K bsatya,
miscellaneous topics in multiprocessing,
multiprocessors, parallel solutions, physical and logical distribution,
resource scheduling, reliability, synchronization
CR Categories: 1.3, 4.0, 4.3, 4.6, 6.20
maeder bib: concepts of parallelism, parallel algorithms,
parallel implementations,
applications and implications for parallel processing
%X Overview of applications and their parallel solutions, discusses
C.mmp/HYDRA, Cm*/StarOS, PLURIBUS: performance, reliability,
synchronization.
Some discussion in conclusion on the ideal multiprocessor

%A S. Fuller
%A A. Jones
%A I. Durham
%T CMU Cm* Review
%R AD-A050135
%I Department of Computer Science, Carnegie-Mellon University
%C Pittsburgh, PA
%D 1980
%K Ricase,

Before you dismiss these as old, you should realize how bright these
people were, and they got dismissed by supercomputing peers of their time.
These people moved on to powerful positions after leaving CMU, so they
knew what they were doing.

--

From: Eugene Miya on 12 Dec 2006 12:19

In article <jwvwt4xyaot.fsf-monnier+comp.arch(a)gnu.org>,
Stefan Monnier <monnier(a)iro.umontreal.ca> wrote:
>>> My point is that this is the exactly wrong way to go about it. Rather than
>>> hope the compiler will do the right thing, you should be able to write the
which is?
>>> code in such a way that the compiler understands that it is expected to
>>> parallelize the loop in a particular way (or better, if that can be defined
>>> "objectively") and that it's a bug in the source code if it can't.

OK how?

>> The point was, loops are not good things to parallelize.

OK, use what?

>Then read "code" where I wrote "loop".

OK, describe first which language, and what syntax and semantics and
explain side-effects from order of evaluation at the operation level to
issues of shared memory(or not)? Message passing or separation of data
structures involved? Etc.

--

First | Prev | Next | Last
Pages: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler