Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?
From: Noob on 7 Apr 2010 06:17 [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] [ Followup-To: set to comp.programming ] Hello, I imagine the following problem has been efficiently solved over a million times in the past. I need to rotate a picture clockwise 90 degrees. Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets. pixel original_picture[W][H]; pixel rotated_picture[H][W]; At the implementation level, I'm dealing with 1D integer arrays. int src[W*H*3]; int dest[W*H*3]; Conceptually, the rotation operation comes down to rotated_picture[j][H-1-i] = original_picture[i][j] My actual code is for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); Consider as an example, W = 1644 H = 1164 size = 5.74 MB On my target platform, the rotation takes 1.65 seconds. I'm trying to get this down under half a second. I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level. I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time. Changing the loop to for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; } brings a 5% improvement. I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down. Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling? Comments and insight would be greatly appreciated. Regards.
From: Andrew Smallshaw on 7 Apr 2010 10:40 On 2010-04-07, Noob <root(a)127.0.0.1> wrote: > > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); I haven't looked at it in detail but that immediately caught my eye. Surely: memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int)); -- Andrew Smallshaw andrews(a)sdf.lonestar.org
From: Terje Mathisen "terje.mathisen at on 7 Apr 2010 11:59 Follow-up restored to include a group I read) Noob wrote: > [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > [ Followup-To: set to comp.programming ] > > Hello, > > I imagine the following problem has been efficiently solved > over a million times in the past. > > I need to rotate a picture clockwise 90 degrees. > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > This is indeed a well-known problem, the key to make it fast is to realize that it is very similar to a transpose operation! I.e. first copy everything to the target array (in a single block move), then transpose it, finally you reverse each of the rows. The difference between clock and counter-clock-wise rotation is in doing the row reverse either before or after the transpose. Try both since I don't remeber which is which! The key here is that the transpose operation is quite cache-friendly, much better than your current naive code. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 7 Apr 2010 13:10 In article <romu87-uk61.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Noob wrote: >> >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; > >This is indeed a well-known problem, the key to make it fast is to >realize that it is very similar to a transpose operation! > >I.e. first copy everything to the target array (in a single block move), >then transpose it, finally you reverse each of the rows. > >The difference between clock and counter-clock-wise rotation is in doing >the row reverse either before or after the transpose. Try both since I >don't remeber which is which! > >The key here is that the transpose operation is quite cache-friendly, >much better than your current naive code. For extra marks, produce a much more cache-friendly version, using blocks rather than rows .... You can even do that in place, and make it efficient. But I suspect that you (Terje) know all of this! Regards, Nick Maclaren.
From: Robert Myers on 7 Apr 2010 13:32
On Apr 7, 1:10 pm, n...(a)cam.ac.uk wrote: > In article <romu87-uk61....(a)ntp.tmsw.no>, > Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > > > > > > >Noob wrote: > > >> I imagine the following problem has been efficiently solved > >> over a million times in the past. > > >> I need to rotate a picture clockwise 90 degrees. > > >> Conceptually, my picture is represented by a matrix > >> (W columns, H rows) of pixels, with each pixel > >> represented by 3 octets. > > >> pixel original_picture[W][H]; > >> pixel rotated_picture[H][W]; > > >This is indeed a well-known problem, the key to make it fast is to > >realize that it is very similar to a transpose operation! > > >I.e. first copy everything to the target array (in a single block move), > >then transpose it, finally you reverse each of the rows. > > >The difference between clock and counter-clock-wise rotation is in doing > >the row reverse either before or after the transpose. Try both since I > >don't remeber which is which! > > >The key here is that the transpose operation is quite cache-friendly, > >much better than your current naive code. > > For extra marks, produce a much more cache-friendly version, using > blocks rather than rows .... > > You can even do that in place, and make it efficient. But I suspect > that you (Terje) know all of this! > What would happen if you changed the default storage order for arrays of dimension two and greater, so that they were always stored in blocks with some cache or node-friendly order? You could make it nearly transparent in a language like c++ by defining appropriate objects. Robert. |