Processors stall on OLTP workloads about half the time--almostno matter what you do [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do

From: Rob Warnock on 22 Apr 2010 21:27

Morten Reistad <first(a)last.name> wrote:
+---------------
| Interrupt coalescing; from Linux 1.6.24; gives an order of
| magnitude more i/o performance on networks and similar devices.
+---------------

While your first and last phrases are certainly quite correct,
the middle one gives the impression that Linux invented interrupt
coalescing. Actually, it came *quite* late to that game!! E.g., I
was doing interrupt coalescing in terminal device drivers[1] in
TOPS-10 circa 1972 [and, yes, it *seriously* improved terminal I/O
performance!!]; in Fortune Systems's FOR:OS in 1985; in SGI's Irix
network code circa 1990; etc. And I certainly wasn't the only one.
Linux is reinventing a very old wheel here.

-Rob

[1] DCA "SmartMux" frontends for DEC PDP-10s.

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Terje Mathisen "terje.mathisen at on 23 Apr 2010 02:03

Rob Warnock wrote:
> Morten Reistad<first(a)last.name> wrote:
> +---------------
> | Interrupt coalescing; from Linux 1.6.24; gives an order of
> | magnitude more i/o performance on networks and similar devices.
> +---------------
>
> While your first and last phrases are certainly quite correct,
> the middle one gives the impression that Linux invented interrupt
> coalescing. Actually, it came *quite* late to that game!! E.g., I
> was doing interrupt coalescing in terminal device drivers[1] in
> TOPS-10 circa 1972 [and, yes, it *seriously* improved terminal I/O

Indeed it did.

I did the same for my PcDos x86 terminal emulator/file transfer program
around 1983, it allowed a 4.77 Mhz 8088 cpu to never drop bytes, and
have enough cpu to do all the escape processing.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 23 Apr 2010 03:12

In article <hqqapj$6hq$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>>
>> When I first looked at the "SMT" papers, I thought that it was a
>> neat idea but, as I have just posted, I couldn't follow the explanation
>> and so worked out the mathematics. I then discovered the reason that
>> the throughput comparisons did NOT include an SMT design versus a
>> multi-core one with the same transistor count.
>
>I don't see how you get a multi-core design with the same transistor
>count as a multi-threaded one. I have seen numbers of 5% additional
>logic for a second thread. Mostly you duplicate the registers and add a
>little logic. But with two cores, clearly you get 100% overhead,
>duplicating the registers, the execution units, the L1 caches and all
>the other logic.

See Mitch Alsup's posting. He has explained it far better than I can.

Regards,
Nick Maclaren.

From: "Andy "Krazy" Glew" on 23 Apr 2010 03:51

On 4/21/2010 9:02 AM, Robert Myers wrote:
> I have several times referred to a paper by Patterson, et al, circa 1996
> that concludes that most architectural cleverness is useless for OLTP
> workloads, but I have been unable to deliver a specific citation.
>
> It turns out that Patterson is the second author and the technical
> report from Berkeley is from 1998:
>
> www.eecs.berkeley.edu/Pubs/TechRpts/1998/CSD-98-1001.pdf
>
> Performance Characterization of a Quad Pentium Pro SMP Using OLTP
> Workloads by Kimberly Keeton*, David A. Patterson*, Yong Qiang He+,
> Roger C. Raphael+, and Walter E. Baker#.
>
> If Intel management read this report, and I assume it did, it would have
> headed in the direction that Andy has lamented: lots of simple cores
> without energy-consuming cleverness that doesn't help much, anyway--at
> least for certain kinds of workloads. The only thing that really helps
> is cache.
>
> Even though the paper distinguishes between technical and commercial
> workloads, and draws its negative conclusion only for commercial
> workloads, it was interesting to me that, for instance, Blue Gene went
> the same direction--many simple processors--for a technical workload so
> as to achieve low power operation.
>
> Robert.

There are two sorts of database workloads:

Small simple queries: "Give me the account balance for Joe". Basically walking down the index Btrees. Each
transaction spends most of its time waiting for cache misses, if not disk/buffer cache misses.

The second class tends to do things like joins. Which often have parts that look like

LOOP small partition of left that stays in cache
traverse left Btrees
LOOP over all right that misses cache
traverse Btree for right
END LOOP
END LOOP

for various permutations of the inner and outer loops.

For such JOINs, the inner loop may get clogged up wuth the cache misses, but the next iteration of the outer loop is
essentially independent. E.g. if you are joining records, and not computing a sum. Even if you are computing a sum,
you can parallelize all except the sum dependence.

This is the classic example of code that does well with loop based speculative multithreading.

Yes, software can do the parallelization too. And it often does. But it still tends to leave large chunks that look
like the above. Which SpMT can handle.

Amd, by the way, a lot of database people will tell you that simple queries already run fast enough. It's complex
queries that they want to speed up.

My lament is that folks like Keeton and Patterson, as well as folks like Itanicides, etc., do these comparisons, BUT
ONLY TO THE ARCHITECTURES THAT THEY CHOOSE. In some ways it almost looks as if they choose deliberate strawmen that
will make their preferences look good. More likely, they are just not aware of work in more advanced microarchitecture.

Now, it is fair to say that simple cores beat first generation OOO cores for simple OLTP workloads. But what do the
numbers look like for advanced SpMT (Gen2 OOO) microarchitectures, on complex transaction workloads? You'll never know
unless you try.

So, it's happened again. Just as OOO CPU research was the poor cousin during the early days of the RISC revolution, so
SpMT is the poor cousin in the early days of MultiCore and ManyCore. If M&M run out of steam, I hope that Haitham
Akkary's SpMT research will be there, still ongoing, to pick up the pieces.

From: Andrew Reilly on 21 Apr 2010 21:56

Hi all,

On Wed, 21 Apr 2010 15:36:42 -0700, MitchAlsup wrote:

> Since this paper was written slightly before the x86 crushed out RISCs
> in their entirety, the modern reality is that technical, comercial, and
> database applications are being held hostage to PC-based thinking. It
> has become just too expensive to target (with more than lip service)
> application domains other than PCs (for non-mobile applications). Thus
> the high end PC chips do not have the memory systems nor interconnects
> that would beter serve other workloads and larger footprint serer
> systems.

I used to look down on the "PC" computers from the MIPS and SPARC
machines that I was using, back in the early 90s, but it doesn't seem to
me that the memory systems of well-specced PC systems of today leave
anything behind that the good technical workstations of that era had.
The current set of chips pretty much seem to be pin limited, which is the
same situation that anyone trying to do a purpose-designed technical
workstation would have to deal with anyway.

So what is "PC-based thinking", and how is it holding us back? What
could we do differently, in an alternate universe, or with an unlimited
bank balance?

Cheers,

--
Andrew

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do