From: Peter Duniho on
Roedy Green wrote:
> On Sun, 17 Jan 2010 20:56:29 -0800, Peter Duniho
> <NpOeStPeAdM(a)NnOwSlPiAnMk.com> wrote, quoted or indirectly quoted
> someone who said :
>
>> Profiling is definitely important for performance-critical code. It can
>> uncover lots of important architecture-independent problems. But it has
>> limited value in generalizing solutions for architecture-specific
>> issues. Only if you can restrict your installation to the same hardware
>> you used for profiling can you address those kinds of problems.
>
> I would have thought by now distributed code would be optimised at the
> customer's machine to suit the specific hardware, not by the
> application, but by the OS using code provided by the CPU maker.

I don't know what you're trying to suggest here.

Some application developers do in fact compile multiple versions of a
program, targeting different architectures, taking advantage of
architecture-specific compiler optimizations.

An even smaller number will even hand-optimize multiple versions
according to architecture.

Then when the application is deployed, they check the architecture and
install the best match.

But a) this doesn't happen all that often, and especially now since we
have so many more desktop architectures to deal with, and b) the code is
still optimized by the developer using profiling and other techniques on
their own reference hardware, not the customer's machine.

Inasmuch as applications often spend a lot of their computational cycles
in OS code, as long as the OS has been optimized for specific hardware
(and again, this is far from a broadly-applied technique�most OS
components will only have non-architecture-dependent optimizations),
then sure�an application might find itself executing some
architecture-specific-optimized code found in the OS. But the OS isn't
going to rewrite the application code itself or optimize it in any way.

> Presumably you could afford to spend more time in analysis than you
> can on the fly in hardware while the code is running.

Still not sure what you're talking about. The OS definitely can do some
application-specific optimizations, such as Prefetch in Vista and
Windows 7, doing run-time management of caching and virtual memory, that
sort of thing. But the OS isn't analyzing the application code itself
and changing it so that it's specifically optimized for the current
architecture.

Note: all of the above is with respect to native code. As was my
original point, with a platform like Java or .NET, the platform (which
may or may not be part of the OS) _can_ in fact optimize per specific
architecture, specifically because the application code isn't actually
compiled for the platform, but rather a virtual machine. The VM then
has the opportunity to do architecture-specific optimizations in the
application itself. But this only works because the application code
hasn't yet even been compiled for the specific architecture, never mind
optimized for any specific architecture (except of course for the
virtual machine's "architecture").

And that's only theoretically possible. I've never heard any
suggestions that Java actually does include architecture-specific
optimizations, either in the JVM itself, or as part of the optimizer in
the JIT compiler.

No doubt there are some esoteric systems out there that do optimize
already-compiled native code on the fly. But they aren't part of the
mainstream systems we use, and I don't think it's realistic to think
they will become part of them any time in the near future.

Pete
From: Patricia Shanahan on
Lew wrote:
> Patricia Shanahan wrote:
>> Roedy Green wrote:
>> ...
>>> This suggests that interpretive code with a tight core might run
>>> faster than "highly optimised" machine code since you could arrange
>>> that the core of it was entirely in cache.
>> ...
>>
>> How would you implement an interpreter to avoid executing a totally
>> unpredictable branch for each instruction?
>
> This apparently rhetorical question leads to some interesting
> possibilities, e.g., the exploitation of latency. There is likely a
> tension between these possibilities and cache-locality, however since
> cache is a hack we can expect its limits to be less restrictive over
> time. Latency, OTOH, is likely to become a greater and greater issue.
> Hyperthreading is one technique that exploits latency.
>
> An answer to the question is to load all possible branches into the
> pipeline during the latency (-ies) involved in evaluating the "if" or
> other actions. (There is no such thing as a "totally unpredictable
> branch" as all branches can be predicted.) If the conclusion of the
> branch evaluation finds all, or at least all the most likely options
> already loaded up, the system can simply discard the unused branches.
> This term goes by various names; I believe one is "speculative execution".
....

True, all branches can be predicted. The question is whether they can be
predicted with high enough frequency of correct prediction to avoid
spending most of the processor's resources in pipeline stalls.

Speculative execution has limited benefits in the interpreter situation.
Suppose we limit the issue to the code for the 5 most frequently used
bytecode operations. We fetch a branch based on the opcode. After seeing
the branch enter the pipeline, we start speculatively fetching the 5
most frequent successors into 5 parallel pipelines, and feeding their
instructions through the pipeline stages.

A few cycles later we fetch the end of the code for one of them, but the
original branch has still not executed. Now we have a total of 9 places
to fetch code from, the continuations of the 4 original alternatives
that have not yet been completely fetched, plus the alternatives for the
one that has been completely fetched.

The fan-out is exponential in the number of bytecode operations that are
to be executed in a time equal to the number of pipeline stages. If that
number is small, we are running very slowly, and compiled code would
probably have won, even with a bigger instruction cache footprint.

Speculative execution depends on there being only a small number of
alternatives for each hard to predict branch, and those branches usually
being separated by runs of code with only predictable branches that are
at least as long as the pipeline is deep. A tight loop containing an
interpreted opcode based switch is the exact opposite.

Patricia
From: Patricia Shanahan on
Peter Duniho wrote:
....
> Note: all of the above is with respect to native code. As was my
> original point, with a platform like Java or .NET, the platform (which
> may or may not be part of the OS) _can_ in fact optimize per specific
> architecture, specifically because the application code isn't actually
> compiled for the platform, but rather a virtual machine. The VM then
> has the opportunity to do architecture-specific optimizations in the
> application itself. But this only works because the application code
> hasn't yet even been compiled for the specific architecture, never mind
> optimized for any specific architecture (except of course for the
> virtual machine's "architecture").
....

A JVM can go beyond that. Some of its optimization decisions can be the
result of what is happening on this run, and may be different from the
decisions it would have taken with different inputs, as well as in a
different environment.

Patricia
From: Martin Gregorie on
On Mon, 18 Jan 2010 21:01:44 -0800, Roedy Green wrote:

> This suggests the CPU makers should make simpler CPUs, and turn the real
> estate over to a bigger cache, or focus all the smarts in the CPU on
> carrying on while a load is stalled.
>
Sounds like a return to RISC to me. Time to revisit the Motorola 88000
chipset, or at least its cache handling?


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
From: Thomas Pornin on
According to Lew <noone(a)lewscanon.com>:
> or even the 8MB Level 1 cache of the not-distant future

This is a bold prediction.

My current PC, bought in January 2009, has 32 KB of "fast RAM", where
fast RAM is the piece of RAM for which accesses are mostly on par with
computation speed (that's L1 cache). My 1984 home computer, 25 years
before that, also had 32 KB of fast RAM. From my point of view, the
amount of fast RAM is about constant, and in the not-distant future
computers will still have 32 KB of fast RAM, not much more.

What may change in the future is the number of concurrent execution
units, i.e. "cores". My PC has four cores, while my 1984 home computer
only had one. The increase in parallelism is much slower than what was
originally expected; in the late 80's it was often predicted that on
year 2000 a typical computer would have at least a dozen CPU, possibly
much more (remember the Transputer !). To some extent, the "all memory
is shared" multi-threading model proved to be a powerful moderator,
prompting instead CPU makers to invest billions of dollars into making
single-threaded CPU faster. The multi-core model is becoming more common
now only because CPU makers have run out of ideas about how to make
single-core CPU fasters.

So my own "prediction" goes thus: right now, the most important factor
in application performance is cache locality. In the future, the most
important factor will be inter-thread communication: applications which
can limit the amount of data to be exchanged between threads will be
able to spread on massively multi-core systems, possibly even clusters,
which is where performance will lie.


--Thomas Pornin
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Code and Creation 04972
Next: Create grid