Prev: Phase/Amplitude detector in PLL
Next: Which type of volatile RAM has the least duration of data remanence when power-offed?
From: clock_mountain on 12 May 2010 21:47 It is a simple function with three-layer for-loops. Loop 1 has 7 iterations, loop 2 has 2 and loop 3, the inner most loop has 12 iterations. The inner most loop only does one CMPYR (TI intrinsic for one complex multiply and rounding). Theoretically it should take 7*2*12=168 cycles assuming one CMPYR per cycle with software pipelining. I profiled my code with TI CCS v3.3 standard simulator and the total cycles for the three loops is 196 cycles, which is acceptable. But the cycle count for the function is 594. In the function body, there are a few lines of definition, simple offset computation and pointer instantiation, in addition to the loops. Where does these 594-196=398 cycles go?
From: Vladimir Vassilevsky on 12 May 2010 21:57 clock_mountain wrote: > It is a simple function with three-layer for-loops. Loop 1 has 7 > iterations, loop 2 has 2 and loop 3, the inner most loop has 12 > iterations. The inner most loop only does one CMPYR (TI intrinsic for > one complex multiply and rounding). Theoretically it should take > 7*2*12=168 cycles assuming one CMPYR per cycle with software > pipelining. I profiled my code with TI CCS v3.3 standard simulator and > the total cycles for the three loops is 196 cycles, which is > acceptable. But the cycle count for the function is 594. In the > function body, there are a few lines of definition, simple offset > computation and pointer instantiation, in addition to the loops. Where > does these 594-196=398 cycles go? Unroll your inner loop and do pointer arithmetics by hand instead of using arrays and indexes. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
From: Manny on 12 May 2010 22:40 On May 13, 2:47 am, clock_mountain <wuzhongs...(a)gmail.com> wrote: > It is a simple function with three-layer for-loops. Loop 1 has 7 > iterations, loop 2 has 2 and loop 3, the inner most loop has 12 > iterations. The inner most loop only does one CMPYR (TI intrinsic for > one complex multiply and rounding). Theoretically it should take > 7*2*12=168 cycles assuming one CMPYR per cycle with software > pipelining. I profiled my code with TI CCS v3.3 standard simulator and > the total cycles for the three loops is 196 cycles, which is > acceptable. But the cycle count for the function is 594. In the > function body, there are a few lines of definition, simple offset > computation and pointer instantiation, in addition to the loops. Where > does these 594-196=398 cycles go? Chip vendor's tools are by definition buggy, or at least that's what we think they are us lay users. It does make sense to a get proper third party toolchain. The logic here is that those folks make living out of this and will try harder at making the thing more usable. If you can't, play it safe and always try to follow templates whenever you can. And of course, there's always the possibility that your missing on some important quirk. -Momo
From: clock_mountain on 13 May 2010 03:36 On May 12, 6:57 pm, Vladimir Vassilevsky <nos...(a)nowhere.com> wrote: > clock_mountain wrote: > > It is a simple function with three-layer for-loops. Loop 1 has 7 > > iterations, loop 2 has 2 and loop 3, the inner most loop has 12 > > iterations. The inner most loop only does one CMPYR (TI intrinsic for > > one complex multiply and rounding). Theoretically it should take > > 7*2*12=168 cycles assuming one CMPYR per cycle with software > > pipelining. I profiled my code with TI CCS v3.3 standard simulator and > > the total cycles for the three loops is 196 cycles, which is > > acceptable. But the cycle count for the function is 594. In the > > function body, there are a few lines of definition, simple offset > > computation and pointer instantiation, in addition to the loops. Where > > does these 594-196=398 cycles go? > > Unroll your inner loop and do pointer arithmetics by hand instead of > using arrays and indexes. > > Vladimir Vassilevsky > DSP and Mixed Signal Design Consultanthttp://www.abvolt.com Thanks to all your replies. With "-o3" compiler option, the access counts for loop2 and loop 3 are 0 and 7 respectively. It implies that both loop 2 and loop 3 are unrolled by the optimizing TI compiler. I think that's how i can get 168 complex multiplies (16MSB for real part, 16LSB for imaginary part) done by 196 cycles. This function takes 3 pointers and 2 integers as input arguments. The 3 pointers point to global buffers. Does that matter?
From: Vladimir Vassilevsky on 13 May 2010 09:39
clock_mountain wrote: > On May 12, 6:57 pm, Vladimir Vassilevsky <nos...(a)nowhere.com> wrote: > >>clock_mountain wrote: >> >>>It is a simple function with three-layer for-loops. Loop 1 has 7 >>>iterations, loop 2 has 2 and loop 3, the inner most loop has 12 >>>iterations. The inner most loop only does one CMPYR (TI intrinsic for >>>one complex multiply and rounding). Theoretically it should take >>>7*2*12=168 cycles assuming one CMPYR per cycle with software >>>pipelining. I profiled my code with TI CCS v3.3 standard simulator and >>>the total cycles for the three loops is 196 cycles, which is >>>acceptable. But the cycle count for the function is 594. In the >>>function body, there are a few lines of definition, simple offset >>>computation and pointer instantiation, in addition to the loops. Where >>>does these 594-196=398 cycles go? >> >>Unroll your inner loop and do pointer arithmetics by hand instead of >>using arrays and indexes. > > Thanks to all your replies. With "-o3" compiler option, the access > counts for loop2 and loop 3 are 0 and 7 respectively. It implies that > both loop 2 and loop 3 are unrolled by the optimizing TI compiler. I > think that's how i can get 168 complex multiplies (16MSB for real > part, 16LSB for imaginary part) done by 196 cycles. This function > takes 3 pointers and 2 integers as input arguments. The 3 pointers > point to global buffers. Does that matter? Now stop babbling and go do what you been told. VLV |