Floating point DSPs [DSP]

Prev: save your money and open all blocked sites now
Next: pulsewidth and bandwidth

From: Raymond Toy on 12 Jul 2010 09:12

On 7/10/10 10:48 AM, Vladimir Vassilevsky wrote:
>
>
> Al Clark wrote:
>
>
>> The TI DSPs are heavily pipelined. I think this is the main reason
>> that assembly language programming is so difficult with them.
>
> TI assembler is evil. Perhaps, it was purposely done that way.

What makes TI asembler evil? Or at least more evil than other assembly
language?

>
>> SHARC instructions execute promptly. You can easily write either
>> assembly (looks a bit like C) and C.
>
> Sharc is also pipelined, and the pipeline is even not fully interlocked.
> Delayed branches and multiple bugs illustrate that.
> AD processors make an impression of nice concept, but lacking realization.

I used a CEVA X DSP. It's also pipelined, and is never interlocked.
You get to manage the pipeline yourself. But the assembler checks for
most pipeline conflicts and warns you about them. (But of course it
can't detect all pipeline issues, so you still get to check them yourself.)

Ray

From: steveu on 12 Jul 2010 10:30

>On Jul 9, 2:45=A0am, "steveu" <steveu(a)n_o_s_p_a_m.coppice.org> wrote:
>> >rickman <gnu...(a)gmail.com> wrote in news:74e09402-9a68-4256-80cb-
>> >d087368c3...(a)b35g2000yqi.googlegroups.com:
>>
>> >> On Jul 7, 12:33 am, Al Clark <acl...(a)danvillesignal.com> wrote:
>> >>> Raymond Toy <toy.raym...(a)gmail.com> wrote innews:i10bse$h90$1
>> >@news.eterna
>> >> l-
>> >>> september.org:
>>
>> >>> > On 7/6/10 6:33 PM, HardySpicer wrote:
>> >>> >> For floating point arithmetic how much faster is an add/subtract
>> than
>> >>> >> a multiply/accumulate? (percentage wise).
>>
>> >>> > Probably depends on the chip. The last time I used a floating
>> point
>> >> dsp
>> >>> > (C30!) all floating point ops (add, sub, mul, mac) finished in a
>> >single
>> >>> > cycle. (I think.)
>>
>> >>> > Ray
>>
>> >>> I entered into the middle of this thread so unless I have the
context
>> >>> wrong....
>>
>> >>> On a SHARC, floating point multiply and floating add have the same
co=
>st
>> >-
>> >> =A0one
>> >>> instruction, actually you can do two each in SIMD with some
>> constraints.
>> >>> Fixed point math also operates in one cycle.
>>
>> >>> Instructions on a SHARC operate at the core clock, which can be as
hi=
>gh
>> >a
>> >> s
>> >>> 450M. They all execute in 1 cycle.
>>
>> >>> I assume that the TI floating point DSPs would be similar.
>>
>> >>> Single cycle (1 instruction) processing is quite normal for DSPs.
>> >Algorit
>> >> hms
>> >>> that trade off multiplies for adds are not generally helpful with
DSP=
>s.
>> >O
>> >> TOH,
>> >>> these techniques can be very useful for other type of devices such
as
>> >FPG
>> >> As
>> >>> or GP microcontrollers.
>>
>> >>> Al Clarkwww.danvillesignal.com
>>
>> >> I haven't checked the specs on the SHARC, but aren't the TI floating
>> >> point chips pipelined? =A0I remember the fixed point chips are (or
wer=
>e,
>> >> it's been a while since I've worked closely with them). =A0To do a
MAC
>> >> operation takes multiple cycles, but you can start an new one on
each
>> >> CPU clock. =A0Certainly it is possible to do floating point
operations
>> >> in purely combinatorial logic, but pipelining lets it run much
faster
>> >> with little added logic.
>>
>> >The TI DSPs are heavily pipelined. I think this is the main reason
that
>> >assembly language programming is so difficult with them.
>>
>> >SHARC instructions execute promptly. You can easily write either
assembl=
>y
>> >(looks a bit like C) and C.
>>
>> All high performance processors are heavily pipelined. The only
alternati=
>ve
>> to taking a number of cycles to complete a floating point operation is
to
>> have a very low clock rate. Both the TI and ADI cores are deeply
pipeline=
>d.
>> The difference is in how much the pipeline is exposed to or hidden from
t=
>he
>> programmer. If you need an answer from one of these processors to feed
in=
>to
>> the next step of the calculation you need to wait quite a few cycles,
>> whether it is by explicit programmer action, or by a hardware
controlled
>> processor stall. In either case, if you don't want to waste cycles you
ha=
>ve
>> to do some serious work hand scheduling the flow.
>>
>> Steve
>
>That is not my experience. I recall now that the TI processors are
>pipelined just as most modern CPUs are pipelined. But the only stalls
>are when a branch instruction is executed, just as any pipelined
>processor stalls when you require an out of line instruction fetch.
>There can also be stalls for data, but that should only be when
>external memory is accessed or simultaneous accesses are made to the
>same memory block, although some memory is dual ported.

If the calculations have no data dependencies (typical FIR pattern),
getting full speed operation is a no-brainer. If there are dependencies
(typical IIR pattern), and you are careful about instruction order, you can
probably keep the processor pumping away on almost every cycle. If you
can't schedule all the calculation opportunities there will be either a)
automatic stalls, if the core supports that, b) hand craft stalls, or c) a
liberal use of pixie dust..

>
>DSP functions typically don't have a problem executing at full speed
>on these processors, they are designed to do that. I don't recall
>having any particular trouble with that. The TI C6x families are a
>bit trickier just because it can be hard to keep all the execution
>units working at full speed, but it is more that you are given more
>flexibility and it can be hard to use while the SHARC devices don't
>have as much flexibility, but are a bit easier to use because of it.
>But then I have not worked much with the SHARC devices so I am no
>expert with them.

Steve

From: glen herrmannsfeldt on 17 Jul 2010 20:21

HardySpicer <gyansorova(a)gmail.com> wrote:

> For floating point arithmetic how much faster is an add/subtract than
> a multiply/accumulate? (percentage wise).

I believe for the 360/91 it is two cycles for double precision
add, six for multiply and 18 for divide. Things have changed
over the years, though some of that change has been using algorithms
like those on the 360/91.

In some cases, it isn't the time but the required hardware
that increases. For more details, specify how you will use
result.

-- glen

From: glen herrmannsfeldt on 17 Jul 2010 20:24

steveu <steveu(a)n_o_s_p_a_m.coppice.org> wrote:
>>On 7/6/10 6:33 PM, HardySpicer wrote:
>>> For floating point arithmetic how much faster is an
>>> add/subtract than a multiply/accumulate? (percentage wise).

>>Probably depends on the chip. The last time I used a floating point dsp
>>(C30!) all floating point ops (add, sub, mul, mac) finished in a single
>>cycle. (I think.)

> It could issue a new floating point instruction each cycle, but each
> instruction took a number of cycles to move through the pipeline, and pop
> out the end. I've never seen a floating point unit that attempted to
> complete the instructions in a single cycle. It would be extremely
> inefficient.

For the early SPARC, they wanted all instructions to be one cycle.
As they couldn't do that for multiply, they instead implemented
a multiply-step instruction that you execute the appropriate number
of times. Sun then used that in the trap routine called when
one attempted to do a multiply. Much better to use a multiple
cycle instruction in the first place.

-- glen

First | Prev |
Pages: 1 2 3 4
Prev: save your money and open all blocked sites now
Next: pulsewidth and bandwidth