From: rickman on
On Jul 7, 12:33 am, Al Clark <acl...(a)danvillesignal.com> wrote:
> Raymond Toy <toy.raym...(a)gmail.com> wrote innews:i10bse$h90$1(a)news.eternal-
> september.org:
>
> > On 7/6/10 6:33 PM, HardySpicer wrote:
> >> For floating point arithmetic how much faster is an add/subtract than
> >> a multiply/accumulate? (percentage wise).
>
> > Probably depends on the chip.  The last time I used a floating point dsp
> > (C30!) all floating point ops (add, sub, mul, mac) finished in a single
> > cycle.  (I think.)
>
> > Ray
>
> I entered into the middle of this thread so unless I have the context
> wrong....
>
> On a SHARC, floating point multiply and floating add have the same cost - one
> instruction, actually you can do two each in SIMD with some constraints.
> Fixed point math also operates in one cycle.
>
> Instructions on a SHARC operate at the core clock, which can be as high as
> 450M. They all execute in 1 cycle.
>
> I assume that the TI floating point DSPs would be similar.
>
> Single cycle (1 instruction) processing is quite normal for DSPs. Algorithms
> that trade off multiplies for adds are not generally helpful with DSPs. OTOH,
> these techniques can be very useful for other type of devices such as FPGAs
> or GP microcontrollers.
>
> Al Clarkwww.danvillesignal.com

I haven't checked the specs on the SHARC, but aren't the TI floating
point chips pipelined? I remember the fixed point chips are (or were,
it's been a while since I've worked closely with them). To do a MAC
operation takes multiple cycles, but you can start an new one on each
CPU clock. Certainly it is possible to do floating point operations
in purely combinatorial logic, but pipelining lets it run much faster
with little added logic.

BTW, there are at least two types of floating point chips these days.
TI has their C67' family which is a barn burner with multiple compute
engines. TI also has smaller, low cost chips which are built for
control apps. But to be honest, I don't recall reading if they are
pipelined or not.

Rick
From: Al Clark on
rickman <gnuarm(a)gmail.com> wrote in news:74e09402-9a68-4256-80cb-
d087368c340c(a)b35g2000yqi.googlegroups.com:

> On Jul 7, 12:33�am, Al Clark <acl...(a)danvillesignal.com> wrote:
>> Raymond Toy <toy.raym...(a)gmail.com> wrote innews:i10bse$h90$1
@news.eterna
> l-
>> september.org:
>>
>> > On 7/6/10 6:33 PM, HardySpicer wrote:
>> >> For floating point arithmetic how much faster is an add/subtract than
>> >> a multiply/accumulate? (percentage wise).
>>
>> > Probably depends on the chip. �The last time I used a floating point
> dsp
>> > (C30!) all floating point ops (add, sub, mul, mac) finished in a
single
>> > cycle. �(I think.)
>>
>> > Ray
>>
>> I entered into the middle of this thread so unless I have the context
>> wrong....
>>
>> On a SHARC, floating point multiply and floating add have the same cost
-
> one
>> instruction, actually you can do two each in SIMD with some constraints.
>> Fixed point math also operates in one cycle.
>>
>> Instructions on a SHARC operate at the core clock, which can be as high
a
> s
>> 450M. They all execute in 1 cycle.
>>
>> I assume that the TI floating point DSPs would be similar.
>>
>> Single cycle (1 instruction) processing is quite normal for DSPs.
Algorit
> hms
>> that trade off multiplies for adds are not generally helpful with DSPs.
O
> TOH,
>> these techniques can be very useful for other type of devices such as
FPG
> As
>> or GP microcontrollers.
>>
>> Al Clarkwww.danvillesignal.com
>
> I haven't checked the specs on the SHARC, but aren't the TI floating
> point chips pipelined? I remember the fixed point chips are (or were,
> it's been a while since I've worked closely with them). To do a MAC
> operation takes multiple cycles, but you can start an new one on each
> CPU clock. Certainly it is possible to do floating point operations
> in purely combinatorial logic, but pipelining lets it run much faster
> with little added logic.
>

The TI DSPs are heavily pipelined. I think this is the main reason that
assembly language programming is so difficult with them.

SHARC instructions execute promptly. You can easily write either assembly
(looks a bit like C) and C.

Al Clark
www.danvillesignal.com










From: steveu on
>rickman <gnuarm(a)gmail.com> wrote in news:74e09402-9a68-4256-80cb-
>d087368c340c(a)b35g2000yqi.googlegroups.com:
>
>> On Jul 7, 12:33�am, Al Clark <acl...(a)danvillesignal.com> wrote:
>>> Raymond Toy <toy.raym...(a)gmail.com> wrote innews:i10bse$h90$1
>@news.eterna
>> l-
>>> september.org:
>>>
>>> > On 7/6/10 6:33 PM, HardySpicer wrote:
>>> >> For floating point arithmetic how much faster is an add/subtract
than
>>> >> a multiply/accumulate? (percentage wise).
>>>
>>> > Probably depends on the chip. �The last time I used a floating
point
>> dsp
>>> > (C30!) all floating point ops (add, sub, mul, mac) finished in a
>single
>>> > cycle. �(I think.)
>>>
>>> > Ray
>>>
>>> I entered into the middle of this thread so unless I have the context
>>> wrong....
>>>
>>> On a SHARC, floating point multiply and floating add have the same cost

>-
>> one
>>> instruction, actually you can do two each in SIMD with some
constraints.
>>> Fixed point math also operates in one cycle.
>>>
>>> Instructions on a SHARC operate at the core clock, which can be as high

>a
>> s
>>> 450M. They all execute in 1 cycle.
>>>
>>> I assume that the TI floating point DSPs would be similar.
>>>
>>> Single cycle (1 instruction) processing is quite normal for DSPs.
>Algorit
>> hms
>>> that trade off multiplies for adds are not generally helpful with DSPs.

>O
>> TOH,
>>> these techniques can be very useful for other type of devices such as
>FPG
>> As
>>> or GP microcontrollers.
>>>
>>> Al Clarkwww.danvillesignal.com
>>
>> I haven't checked the specs on the SHARC, but aren't the TI floating
>> point chips pipelined? I remember the fixed point chips are (or were,
>> it's been a while since I've worked closely with them). To do a MAC
>> operation takes multiple cycles, but you can start an new one on each
>> CPU clock. Certainly it is possible to do floating point operations
>> in purely combinatorial logic, but pipelining lets it run much faster
>> with little added logic.
>>
>
>The TI DSPs are heavily pipelined. I think this is the main reason that
>assembly language programming is so difficult with them.
>
>SHARC instructions execute promptly. You can easily write either assembly

>(looks a bit like C) and C.

All high performance processors are heavily pipelined. The only alternative
to taking a number of cycles to complete a floating point operation is to
have a very low clock rate. Both the TI and ADI cores are deeply pipelined.
The difference is in how much the pipeline is exposed to or hidden from the
programmer. If you need an answer from one of these processors to feed into
the next step of the calculation you need to wait quite a few cycles,
whether it is by explicit programmer action, or by a hardware controlled
processor stall. In either case, if you don't want to waste cycles you have
to do some serious work hand scheduling the flow.

Steve

From: Vladimir Vassilevsky on


Al Clark wrote:


> The TI DSPs are heavily pipelined. I think this is the main reason that
> assembly language programming is so difficult with them.

TI assembler is evil. Perhaps, it was purposely done that way.

> SHARC instructions execute promptly. You can easily write either assembly
> (looks a bit like C) and C.

Sharc is also pipelined, and the pipeline is even not fully interlocked.
Delayed branches and multiple bugs illustrate that.
AD processors make an impression of nice concept, but lacking realization.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com


From: rickman on
On Jul 9, 2:45 am, "steveu" <steveu(a)n_o_s_p_a_m.coppice.org> wrote:
> >rickman <gnu...(a)gmail.com> wrote in news:74e09402-9a68-4256-80cb-
> >d087368c3...(a)b35g2000yqi.googlegroups.com:
>
> >> On Jul 7, 12:33 am, Al Clark <acl...(a)danvillesignal.com> wrote:
> >>> Raymond Toy <toy.raym...(a)gmail.com> wrote innews:i10bse$h90$1
> >@news.eterna
> >> l-
> >>> september.org:
>
> >>> > On 7/6/10 6:33 PM, HardySpicer wrote:
> >>> >> For floating point arithmetic how much faster is an add/subtract
> than
> >>> >> a multiply/accumulate? (percentage wise).
>
> >>> > Probably depends on the chip. The last time I used a floating
> point
> >> dsp
> >>> > (C30!) all floating point ops (add, sub, mul, mac) finished in a
> >single
> >>> > cycle. (I think.)
>
> >>> > Ray
>
> >>> I entered into the middle of this thread so unless I have the context
> >>> wrong....
>
> >>> On a SHARC, floating point multiply and floating add have the same cost
> >-
> >>  one
> >>> instruction, actually you can do two each in SIMD with some
> constraints.
> >>> Fixed point math also operates in one cycle.
>
> >>> Instructions on a SHARC operate at the core clock, which can be as high
> >a
> >> s
> >>> 450M. They all execute in 1 cycle.
>
> >>> I assume that the TI floating point DSPs would be similar.
>
> >>> Single cycle (1 instruction) processing is quite normal for DSPs.
> >Algorit
> >> hms
> >>> that trade off multiplies for adds are not generally helpful with DSPs.
> >O
> >> TOH,
> >>> these techniques can be very useful for other type of devices such as
> >FPG
> >> As
> >>> or GP microcontrollers.
>
> >>> Al Clarkwww.danvillesignal.com
>
> >> I haven't checked the specs on the SHARC, but aren't the TI floating
> >> point chips pipelined?  I remember the fixed point chips are (or were,
> >> it's been a while since I've worked closely with them).  To do a MAC
> >> operation takes multiple cycles, but you can start an new one on each
> >> CPU clock.  Certainly it is possible to do floating point operations
> >> in purely combinatorial logic, but pipelining lets it run much faster
> >> with little added logic.
>
> >The TI DSPs are heavily pipelined. I think this is the main reason that
> >assembly language programming is so difficult with them.
>
> >SHARC instructions execute promptly. You can easily write either assembly
> >(looks a bit like C) and C.
>
> All high performance processors are heavily pipelined. The only alternative
> to taking a number of cycles to complete a floating point operation is to
> have a very low clock rate. Both the TI and ADI cores are deeply pipelined.
> The difference is in how much the pipeline is exposed to or hidden from the
> programmer. If you need an answer from one of these processors to feed into
> the next step of the calculation you need to wait quite a few cycles,
> whether it is by explicit programmer action, or by a hardware controlled
> processor stall. In either case, if you don't want to waste cycles you have
> to do some serious work hand scheduling the flow.
>
> Steve

That is not my experience. I recall now that the TI processors are
pipelined just as most modern CPUs are pipelined. But the only stalls
are when a branch instruction is executed, just as any pipelined
processor stalls when you require an out of line instruction fetch.
There can also be stalls for data, but that should only be when
external memory is accessed or simultaneous accesses are made to the
same memory block, although some memory is dual ported.

DSP functions typically don't have a problem executing at full speed
on these processors, they are designed to do that. I don't recall
having any particular trouble with that. The TI C6x families are a
bit trickier just because it can be hard to keep all the execution
units working at full speed, but it is more that you are given more
flexibility and it can be hard to use while the SHARC devices don't
have as much flexibility, but are a bit easier to use because of it.
But then I have not worked much with the SHARC devices so I am no
expert with them.

Rick