From: Andy Glew "newsgroup at on
On 8/5/2010 8:18 AM, Paul A. Clayton wrote:
> On Aug 5, 9:20 am, Andy Glew<"newsgroup at comp-arch.net"> wrote:
> [snip]

> A tiny future file (e.g., 16 bits of three registers) integrated
> with an adder (or two) could be very low latency.
>
> Interestingly, most recent x86 processors do have a limited
> set of front-end registers--the segment registers--(with
> dedicated adders to create a single immediate), though a
> segment register update stalls the pipeline rather than
> allowing forward progress on independent operations.
>
>
> Paul A. Clayton
> just a technophile
>



Plus...

loop counter prediction

stack pointer tracking





This latter, http://www.intel.com/assets/pdf/manual/248966.pdf:

2.1.2.5 Stack Pointer Tracker
The Intel 64 and IA-32 architectures have several commonly used
instructions for
parameter passing and procedure entry and exit: PUSH, POP, CALL, LEAVE
and RET.
These instructions implicitly update the stack pointer register (RSP),
maintaining a
combined control and parameter stack without software intervention.
These instructions
are typically implemented by several μops in previous microarchitectures.
2-9
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The Stack Pointer Tracker moves all these implicit RSP updates to logic
contained in
the decoders themselves. The feature provides the following benefits:
• Improves decode bandwidth, as PUSH, POP and RET are single μop
instructions
in Intel Core microarchitecture.
• Conserves execution bandwidth as the RSP updates do not compete for
execution
resources.
• Improves parallelism in the out of order execution engine as the
implicit serial
dependencies between μops are removed.
• Improves power efficiency as the RSP updates are carried out on small,
dedicated
hardware.



Only thing is, these are somewhat ad-hoc solutions. Not a generic
future file.



By the way, the segment registers are not really implemented as a future
file. They are implemented as a non-renamed or differently renamed
register file, inside the AGU. Not a future file at all. Entirely read
after schedule.
From: Paul A. Clayton on
On Aug 6, 1:31 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> On 8/5/2010 8:18 AM, Paul A. Clayton wrote:
[snip]
> > Interestingly, most recent x86 processors do have a limited
> > set of front-end registers--the segment registers--(with
> > dedicated adders to create a single immediate), though a
[snip]
> Plus...
>
> loop counter prediction

Yet I do not think such predictors actually take advantage of
counter knowledge to resolve rather than just predict
branches.

> stack pointer tracking

In a way this is like a non-speculative stride-based value
predictor.

> By the way, the segment registers are not really implemented as a future
> file. They are implemented as a non-renamed or differently renamed
> register file, inside the AGU.  Not a future file at all. Entirely read
> after schedule.

Hmm. I thought the immediate was added in the front end (to reduce
the number of sources in the AGU). So can the AGUs actually use
four inputs--segment base, immediate, base register, index register?


Paul A. Clayton
just a technophile
From: Norbert Juffa on
"Paul A. Clayton" <paaronclayton(a)embarqmail.com> wrote in message
news:be8676ab-555f-49fd-9550-7d3f8c983c22(a)o19g2000yqb.googlegroups.com...
> On Aug 6, 1:31 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:
> > On 8/5/2010 8:18 AM, Paul A. Clayton wrote:
> [snip]
> > > Interestingly, most recent x86 processors do have a limited
> > > set of front-end registers--the segment registers--(with
> > > dedicated adders to create a single immediate), though a

[...]

> > By the way, the segment registers are not really implemented as a future
> > file. They are implemented as a non-renamed or differently renamed
> > register file, inside the AGU. Not a future file at all. Entirely read
> > after schedule.
>
> Hmm. I thought the immediate was added in the front end (to reduce
> the number of sources in the AGU). So can the AGUs actually use
> four inputs--segment base, immediate, base register, index register?


As I recall it, AMD's K6-family used a 4-input adder, since 16-bit operating
systems using non-zero segment bases were still in common use during most of
the lifetime of that processor family.

The Athlon could only handle a segment base of zero at full speed; an extra
cycle of delay was incurred in case of a non-zero segment base. By the time
Athlon shipped 32-bit operating systems had become common, and from what I
remember they all used a flat address space (i.e. segment base of zero).

-- Norbert


From: Andy Glew "newsgroup at on
On 8/6/2010 11:49 AM, Paul A. Clayton wrote:
> On Aug 6, 1:31 am, Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> On 8/5/2010 8:18 AM, Paul A. Clayton wrote:
> [snip]
>>> Interestingly, most recent x86 processors do have a limited
>>> set of front-end registers--the segment registers--(with
>>> dedicated adders to create a single immediate), though a
> [snip]
>> Plus...
>>
>> loop counter prediction
>
> Yet I do not think such predictors actually take advantage of
> counter knowledge to resolve rather than just predict
> branches.
>
>> stack pointer tracking
>
> In a way this is like a non-speculative stride-based value
> predictor.
>
>> By the way, the segment registers are not really implemented as a future
>> file. They are implemented as a non-renamed or differently renamed
>> register file, inside the AGU. Not a future file at all. Entirely read
>> after schedule.
>
> Hmm. I thought the immediate was added in the front end (to reduce
> the number of sources in the AGU). So can the AGUs actually use
> four inputs--segment base, immediate, base register, index register?

The original P6, and some subsequent P6es, did the full 4 input add. In
a single uop.

Certain subsequent Intel machines split them up into two uops. By the
way: adding in the the segment base was not a problem, since that was a
semi-renamed resource in the AGU. Multi-input adders are easy. The
problem was getting the three nion-segment components out of the OOO
machine: basereg, indexreg, and immediate constant.

I bellieve that AMD took an extra cycle if the segbase was nonzero.