where does those cycles go? [DSP]

Prev: Phase/Amplitude detector in PLL
Next: Which type of volatile RAM has the least duration of data remanence when power-offed?

From: clock_mountain on 12 May 2010 21:47

It is a simple function with three-layer for-loops. Loop 1 has 7
iterations, loop 2 has 2 and loop 3, the inner most loop has 12
iterations. The inner most loop only does one CMPYR (TI intrinsic for
one complex multiply and rounding). Theoretically it should take
7*2*12=168 cycles assuming one CMPYR per cycle with software
pipelining. I profiled my code with TI CCS v3.3 standard simulator and
the total cycles for the three loops is 196 cycles, which is
acceptable. But the cycle count for the function is 594. In the
function body, there are a few lines of definition, simple offset
computation and pointer instantiation, in addition to the loops. Where
does these 594-196=398 cycles go?

From: Vladimir Vassilevsky on 12 May 2010 21:57

clock_mountain wrote:

> It is a simple function with three-layer for-loops. Loop 1 has 7
> iterations, loop 2 has 2 and loop 3, the inner most loop has 12
> iterations. The inner most loop only does one CMPYR (TI intrinsic for
> one complex multiply and rounding). Theoretically it should take
> 7*2*12=168 cycles assuming one CMPYR per cycle with software
> pipelining. I profiled my code with TI CCS v3.3 standard simulator and
> the total cycles for the three loops is 196 cycles, which is
> acceptable. But the cycle count for the function is 594. In the
> function body, there are a few lines of definition, simple offset
> computation and pointer instantiation, in addition to the loops. Where
> does these 594-196=398 cycles go?

Unroll your inner loop and do pointer arithmetics by hand instead of
using arrays and indexes.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

From: Manny on 12 May 2010 22:40

On May 13, 2:47 am, clock_mountain <wuzhongs...(a)gmail.com> wrote:
> It is a simple function with three-layer for-loops. Loop 1 has 7
> iterations, loop 2 has 2 and loop 3, the inner most loop has 12
> iterations. The inner most loop only does one CMPYR (TI intrinsic for
> one complex multiply and rounding). Theoretically it should take
> 7*2*12=168 cycles assuming one CMPYR per cycle with software
> pipelining. I profiled my code with TI CCS v3.3 standard simulator and
> the total cycles for the three loops is 196 cycles, which is
> acceptable. But the cycle count for the function is 594. In the
> function body, there are a few lines of definition, simple offset
> computation and pointer instantiation, in addition to the loops. Where
> does these 594-196=398 cycles go?

Chip vendor's tools are by definition buggy, or at least that's what
we think they are us lay users. It does make sense to a get proper
third party toolchain. The logic here is that those folks make living
out of this and will try harder at making the thing more usable. If
you can't, play it safe and always try to follow templates whenever
you can. And of course, there's always the possibility that your
missing on some important quirk.

-Momo

From: clock_mountain on 13 May 2010 03:36

On May 12, 6:57 pm, Vladimir Vassilevsky <nos...(a)nowhere.com> wrote:
> clock_mountain wrote:
> > It is a simple function with three-layer for-loops. Loop 1 has 7
> > iterations, loop 2 has 2 and loop 3, the inner most loop has 12
> > iterations. The inner most loop only does one CMPYR (TI intrinsic for
> > one complex multiply and rounding). Theoretically it should take
> > 7*2*12=168 cycles assuming one CMPYR per cycle with software
> > pipelining. I profiled my code with TI CCS v3.3 standard simulator and
> > the total cycles for the three loops is 196 cycles, which is
> > acceptable. But the cycle count for the function is 594. In the
> > function body, there are a few lines of definition, simple offset
> > computation and pointer instantiation, in addition to the loops. Where
> > does these 594-196=398 cycles go?
>
> Unroll your inner loop and do pointer arithmetics by hand instead of
> using arrays and indexes.
>
> Vladimir Vassilevsky
> DSP and Mixed Signal Design Consultanthttp://www.abvolt.com

Thanks to all your replies. With "-o3" compiler option, the access
counts for loop2 and loop 3 are 0 and 7 respectively. It implies that
both loop 2 and loop 3 are unrolled by the optimizing TI compiler. I
think that's how i can get 168 complex multiplies (16MSB for real
part, 16LSB for imaginary part) done by 196 cycles. This function
takes 3 pointers and 2 integers as input arguments. The 3 pointers
point to global buffers. Does that matter?

From: Vladimir Vassilevsky on 13 May 2010 09:39

clock_mountain wrote:
> On May 12, 6:57 pm, Vladimir Vassilevsky <nos...(a)nowhere.com> wrote:
>
>>clock_mountain wrote:
>>
>>>It is a simple function with three-layer for-loops. Loop 1 has 7
>>>iterations, loop 2 has 2 and loop 3, the inner most loop has 12
>>>iterations. The inner most loop only does one CMPYR (TI intrinsic for
>>>one complex multiply and rounding). Theoretically it should take
>>>7*2*12=168 cycles assuming one CMPYR per cycle with software
>>>pipelining. I profiled my code with TI CCS v3.3 standard simulator and
>>>the total cycles for the three loops is 196 cycles, which is
>>>acceptable. But the cycle count for the function is 594. In the
>>>function body, there are a few lines of definition, simple offset
>>>computation and pointer instantiation, in addition to the loops. Where
>>>does these 594-196=398 cycles go?
>>
>>Unroll your inner loop and do pointer arithmetics by hand instead of
>>using arrays and indexes.
>
> Thanks to all your replies. With "-o3" compiler option, the access
> counts for loop2 and loop 3 are 0 and 7 respectively. It implies that
> both loop 2 and loop 3 are unrolled by the optimizing TI compiler. I
> think that's how i can get 168 complex multiplies (16MSB for real
> part, 16LSB for imaginary part) done by 196 cycles. This function
> takes 3 pointers and 2 integers as input arguments. The 3 pointers
> point to global buffers. Does that matter?

Now stop babbling and go do what you been told.

VLV

|
Pages: 1
Prev: Phase/Amplitude detector in PLL
Next: Which type of volatile RAM has the least duration of data remanence when power-offed?