From: Robert Myers on
On Apr 7, 1:10 pm, n...(a)cam.ac.uk wrote:
> In article <romu87-uk61....(a)ntp.tmsw.no>,
> Terje Mathisen  <"terje.mathisen at tmsw.no"> wrote:
>
>
>
>
>
> >Noob wrote:
>
> >> I imagine the following problem has been efficiently solved
> >> over a million times in the past.
>
> >> I need to rotate a picture clockwise 90 degrees.
>
> >> Conceptually, my picture is represented by a matrix
> >> (W columns, H rows) of pixels, with each pixel
> >> represented by 3 octets.
>
> >> pixel original_picture[W][H];
> >> pixel rotated_picture[H][W];
>
> >This is indeed a well-known problem, the key to make it fast is to
> >realize that it is very similar to a transpose operation!
>
> >I.e. first copy everything to the target array (in a single block move),
> >then transpose it, finally you reverse each of the rows.
>
> >The difference between clock and counter-clock-wise rotation is in doing
> >the row reverse either before or after the transpose. Try both since I
> >don't remeber which is which!
>
> >The key here is that the transpose operation is quite cache-friendly,
> >much better than your current naive code.
>
> For extra marks, produce a much more cache-friendly version, using
> blocks rather than rows ....
>
> You can even do that in place, and make it efficient.  But I suspect
> that you (Terje) know all of this!
>
What would happen if you changed the default storage order for arrays
of dimension two and greater, so that they were always stored in
blocks with some cache or node-friendly order? You could make it
nearly transparent in a language like c++ by defining appropriate
objects.

Robert.
From: George Neuner on
On Wed, 07 Apr 2010 12:05:03 -0500, Vladimir Vassilevsky
<nospam(a)nowhere.com> wrote:

>Rob Gaddi wrote:
>
>> On 4/7/2010 8:35 AM, Paul Carpenter wrote:
>>
>>> [snip]
>>> Everytime you use array[ index ] the complier creates an arithmetic
>>> pointer manipulation to calculate the actual pointer to the desired
>>> location.
>>
>> Is that really true? It seems like the sort of thing that -O3 ought to
>> be able to take care of.
>
>It depends. Compilers can optimize simple index arithmetics into
>manipulation with pointers. However they are goofing with increased
>level of loop nesting and more complicated index expressions. For that
>matter, pointer arithmetics is usually faster.

It depends on the compiler. Most C90 and later compilers do a pretty
good job optimizing array indexing and are not so easily confused by
high dimensionality or sub-array blocking. In many cases you will
actually confuse them and reduce performance if you try to do your own
pointer arithmetic.

That said, old versions of GCC (certainly anything prior to 3.x) were
notoriously bad at optimizing even 3-D array indexing. If your chip
requires using one of these old compilers, then you certainly should
do your own pointer arithmetic.

George
From: Ulf Samuelsson on
Noob skrev:
> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> [ Followup-To: set to comp.programming ]
>
> Hello,
>
> I imagine the following problem has been efficiently solved
> over a million times in the past.
>
> I need to rotate a picture clockwise 90 degrees.

Modern Hardware like the AT91SAM9M10 will do this in a H/W accelerator.

Best Regards
Ulf Samuelsson




>
> Conceptually, my picture is represented by a matrix
> (W columns, H rows) of pixels, with each pixel
> represented by 3 octets.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
>
> At the implementation level, I'm dealing with 1D integer arrays.
>
> int src[W*H*3];
> int dest[W*H*3];
>
> Conceptually, the rotation operation comes down to
>
> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> Consider as an example,
> W = 1644
> H = 1164
> size = 5.74 MB
>
> On my target platform, the rotation takes 1.65 seconds.
>
> I'm trying to get this down under half a second.
>
> I'm using the GNU tool chain. For some weird reason, we
> compile everything -O0. The first thing I'll try is crank
> gcc's optimization level.
>
> I'm hoping gcc can perform some strength reduction, as the
> index calculation seems to be taking a non-negligible fraction
> of the total run-time.
>
> Changing the loop to
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> src += 3;
> }
>
> brings a 5% improvement.
>
> I thought changing the memcpy call to 3 explicit assignments
> might help, but it actually slowed things down.
>
> Perhaps I need to perform some tiling magic... I don't think
> gcc performs automatic tiling?
>
> Comments and insight would be greatly appreciated.
>
> Regards.
From: Ulf Samuelsson on
Noob skrev:
> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> [ Followup-To: set to comp.programming ]
>
> Hello,
>
> I imagine the following problem has been efficiently solved
> over a million times in the past.
>
> I need to rotate a picture clockwise 90 degrees.

Modern Hardware like the AT91SAM9M10 will do this in a H/W accelerator.

Best Regards
Ulf Samuelsson




>
> Conceptually, my picture is represented by a matrix
> (W columns, H rows) of pixels, with each pixel
> represented by 3 octets.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
>
> At the implementation level, I'm dealing with 1D integer arrays.
>
> int src[W*H*3];
> int dest[W*H*3];
>
> Conceptually, the rotation operation comes down to
>
> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> Consider as an example,
> W = 1644
> H = 1164
> size = 5.74 MB
>
> On my target platform, the rotation takes 1.65 seconds.
>
> I'm trying to get this down under half a second.
>
> I'm using the GNU tool chain. For some weird reason, we
> compile everything -O0. The first thing I'll try is crank
> gcc's optimization level.
>
> I'm hoping gcc can perform some strength reduction, as the
> index calculation seems to be taking a non-negligible fraction
> of the total run-time.
>
> Changing the loop to
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> src += 3;
> }
>
> brings a 5% improvement.
>
> I thought changing the memcpy call to 3 explicit assignments
> might help, but it actually slowed things down.
>
> Perhaps I need to perform some tiling magic... I don't think
> gcc performs automatic tiling?
>
> Comments and insight would be greatly appreciated.
>
> Regards.
From: Thad Smith on
Noob wrote:
>
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> Consider as an example,
> W = 1644
> H = 1164
> size = 5.74 MB
>
> On my target platform, the rotation takes 1.65 seconds.
>
> I'm trying to get this down under half a second.

If I were tackling it, I would first check Ulf's suggestion to see if there is a
hardware solution. If not, look at the generated machine code and compare with
what you could hand code. If you can beat the compiler, either tease the code
to get better results, checking the compiler output as you go, or use an
assembly routine. I learn a lot by studying compiler output.

--
Thad
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.