hardware prefeth, branch predictor in the context of multitasking [Computer Architecture]

Prev: Online Exams for Certification, Free Practice Exams, Study Material, Dumps
Next: Motherboard unusuable because of 1 millimeter of missing plastic ?!?!?!?!

From: Quadibloc on 9 Nov 2009 14:30

On Nov 9, 12:24 pm, n...(a)cam.ac.uk wrote:

> In particular, when you are dealing with programs limited by memory
> latency, as so many area, there has been very little increase over
> the years. 15%, if one is feeling generous, less if not.

What with the huge on-chip caches on today's chips, I suppose it
depends on what one means by "limited by memory latency". To actually
be limited by the latency of external DRAM, a program would have to
try rather harder these days, so that has to count as some sort of an
improvement as well.

John Savard

From: "Andy "Krazy" Glew" on 9 Nov 2009 21:57

Terje Mathisen wrote:
> I'm somewhat in love with the idea of a multi-level table:
>
> A large shared but simple (2-bit counter or similar) augmented with a
> small exception table per core which only stores branch info for
> branches that have missed in the large table.
>
> Is this even feasible? :-)

Yes.

But when I submitted papers on such designs, 1996-2000, they got rejected.

Daniel Jimenez eventually broke through, with the second level table
being a great big neural net predictor.

From: Daniel A. Jimenez on 9 Nov 2009 23:00

In article <4AF8D67F.6060604(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>Terje Mathisen wrote:
>> I'm somewhat in love with the idea of a multi-level table:
>>
>> A large shared but simple (2-bit counter or similar) augmented with a
>> small exception table per core which only stores branch info for
>> branches that have missed in the large table.
>>
>> Is this even feasible? :-)
>
>Yes.
>
>But when I submitted papers on such designs, 1996-2000, they got rejected.
>
>Daniel Jimenez eventually broke through, with the second level table
>being a great big neural net predictor.

I'd love to take the credit for a multi-level branch predictor, but I
think my contribution was to write up the multi-level idea that had been
around for a while and use it for crazy neural network schemes. (My first ,
idea was to use caching for branch predictor tables. Turns out, nope,
that's a terrible idea, there's no locality in those tables of counters.)
EV6 had what I think of as a multi-level branch predictor: a first level
cache line predictor that's quick and dumb, and a second-level hybrid
predictor that's slow and smart, and can override the line predictor.
I'm sure people are working at doing this for sharing prediction resources
between cores or thread contexts. Dean Tullsen had an interesting paper in
last year's ASPLOS about doing more accurate branch prediction for short
threads, and the TRIPS folks have been working on a hierarchical predictor
for their TFlex microarchitecture.
--
Daniel Jimenez djimenez(a)cs.utexas.edu
"I've so much music in my head" -- Maurice Ravel, shortly before his death.
" " -- John Cage

From: "Andy "Krazy" Glew" on 11 Nov 2009 00:08

Daniel A. Jimenez wrote:
> In article <4AF8D67F.6060604(a)patten-glew.net>,
> Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>> Terje Mathisen wrote:
>>> I'm somewhat in love with the idea of a multi-level table:
>>>
>>> A large shared but simple (2-bit counter or similar) augmented with a
>>> small exception table per core which only stores branch info for
>>> branches that have missed in the large table.
>>>
>>> Is this even feasible? :-)
>> Yes.
>>
>> But when I submitted papers on such designs, 1996-2000, they got rejected.
>>
>> Daniel Jimenez eventually broke through, with the second level table
>> being a great big neural net predictor.
>
> I'd love to take the credit for a multi-level branch predictor, but I
> think my contribution was to write up the multi-level idea that had been
> around for a while and use it for crazy neural network schemes. (My first ,
> idea was to use caching for branch predictor tables. Turns out, nope,
> that's a terrible idea, there's no locality in those tables of counters.)

1) You can make the BTB into a multilevel table. There *is* locality there.

2) You can make the pattern tables multilevel. But I found that it was
not efficient to make the L1 into a cache - because cache implies cache
tags. My version just had the L1 being a small fast dumb predictor, and
the L2 being a big, slow, smarter and more complex predictor. I
simulated all sorts of combinations, such as an L1 gshare with an L2 a
great big McFarling hybrid, etc., - and overall, I found that it gave
the effective prediction accuracy of the larger predictor, with the
predictor latency (on predicted taken branches) of the small predictor.

I also tried a version where the L1 was a plain old predictor, and the
L2 was a "cache", containing tagged patterns that had produced
mispredictions in the past, with history lengths longer than the L1
uses. It worked, but the area efficiency was questionable. Tags are
expensive, and untagged tables of two bit counters are very cheap.

--

I should also mention that Adam Butts did similar work, after me and
before Daniel.

From: Paul A. Clayton on 14 Nov 2009 16:03

On Nov 11, 12:08 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
[snip]
> 2) You can make the pattern tables multilevel. But I found that it was
> not efficient to make the L1 into a cache - because cache implies cache
> tags. My version just had the L1 being a small fast dumb predictor, and
> the L2 being a big, slow, smarter and more complex predictor. I
> simulated all sorts of combinations, such as an L1 gshare with an L2 a
> great big McFarling hybrid, etc., - and overall, I found that it gave
> the effective prediction accuracy of the larger predictor, with the
> predictor latency (on predicted taken branches) of the small predictor.
>
> I also tried a version where the L1 was a plain old predictor, and the
> L2 was a "cache", containing tagged patterns that had produced
> mispredictions in the past, with history lengths longer than the L1
> uses. It worked, but the area efficiency was questionable. Tags are
> expensive, and untagged tables of two bit counters are very cheap.

Would it be useful to have a one-bit quasi-tag for the confidence
bit?
A match means a high-confidence possible hit OR a low-confidence
miss; a mis-match means a low-confidence possible hit OR a
high-confidence miss.

Paul A. Clayton
just a technophile

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Online Exams for Certification, Free Practice Exams, Study Material, Dumps
Next: Motherboard unusuable because of 1 millimeter of missing plastic ?!?!?!?!