From: jonathan on 13 Feb 2010 19:42 On Feb 2, 8:52 am, Jean-Pierre Rosen <ro...(a)adalog.fr> wrote: > Jerry a écrit :> I've never understood why Ada does not allow slicing in > > multidimensional arrays. What are the safety issues involved? And how > > is it safe to force the programmer into ad hoc methods? > > One-dimensional slices are simple and efficient. Multidimensional slices > are a can of worms. > > I guess you are thinking about rectangular slices. But why stop there? A > slice comprising the main diagonal and the diagonal above and below can > be very useful for some calculations. Or a slice which is the triangular > part of the upper half... > > AFAICT, Fortran-99 does provide this - and the syntax is so complicated > that nobody uses it. And implementation is also a nightmare. > > When designing a programming language, you have to stop at some point. > The ratio (cost of implementation) / usefulness is a good measure for > this. I think the ratio was simply to high for this feature. > > -- > --------------------------------------------------------- > J-P. Rosen (ro...(a)adalog.fr) > Visit Adalog's web site athttp://www.adalog.fr I've never felt the need for two dimensional (or higher dimensional) slicing. It's partly a performance issue: if you make the data storage matrix as small as possible (ie make it exactly the same size as the matrix-transformation you are doing) then you sometimes get a disastrous loss of efficiency. If on the other hand you always make the data storage matrix larger than necessary (so that your transformation will always be performed on a sub-matrix of the data storage matrix), then you always have the option of avoiding these efficiency losses. Once you write the transformation routine so that it operates on sub-matrices of the data storage matrix, then you usually don't have to slice or copy a sub-matrix from the data storage matrix in order to transform it. Here are some Ada and Fortran examples of the problem on various sized data storage arrays. I used some bits and pieces of routines from http://web.am.qub.ac.uk/users/j.parker/miscellany/ First example: we eigen-decompose an N x N = 2048 x 2048 matrix. The data storage matrix is M x M = (1024+Padding) x (1024+Padding) Here is the running time in seconds of an iterative jacobi eigen-decomposition: 2048x2048: 322 sec, gnat (Padding=24) 2048x2048: 1646 sec, gnat (Padding=0) 2048x2048: 1632 sec, gfortran (Padding=0) 2048x2048: 1492 sec, ifort (Padding=0) The observed 500% slowdown in the absence of padded arrays is unacceptable, even if it is a rare event (occurring only on 2**p sized data arrays). In fact it's not all that rare ... more comments on that below. (BTW, ifort is the INTEL fortran compiler, all optimizations at Max. gfortran is the gcc variant.) By the way, I never bothered to write a paddable Fortran routine, but here are some more timings near a power-of-2 array bounds: 1023x1023: 33.5 sec, gnat (Padding=9) 1023x1023: 42.4 sec, gfortran (Padding=0) 1023x1023: 37.3 sec, ifort (Padding=0) 1022x1022: 33.5 sec, gnat (Padding=10) 1022x1022: 30.2 sec, gfortran (Padding=0) 1022x1022: 28.3 sec, ifort (Padding=0) 1024x1024: 33.2 sec, gnat (Padding=8) 1024x1024: 96.0 sec, gnat (Padding=0) 1024x1024: 116 sec, gfortran (Padding=0) 1024x1024: 43 sec, ifort (Padding=0) There is one puzzle here I don't have time solve. Normally, a good fortran will automatically pad the array for you ... I recall that happening in the past. This time it seems to have slipped its fortran mind. The ifort man pages: -O3 enables "Padding the size of certain power-of-two arrays to allow more efficient cache use." But I used -O3 and also -O3 -fast ... maybe did something wrong, but important lesson: compiler optimization policies change with time, and of course vary from compiler to compiler. You can't rely on them or even, amazingly, man pages. It's much better to write the program in a way that is insensitive to changing optimization policies. A web search of "cache thrashing" will reveal much depressing detail on the subject. The efficiency problems discussed above occur as our arrays become large and spill out of the L3 cache (6 Meg in the present case). Just to demonstrate that these problems show up on all sorts of arrays, I did some runs in the 2000 to 3000 range, this time using a householder decomposition scavenged from the Golub singular value decomposition. Can still find plenty of 500%'s: 2102x2102, 3.93 sec, gnat (Padding = 0) 2102x2102, 3.19 sec, gnat (Padding = 8) 2176x2176, 5.03 sec, gnat (Padding = 0) 2176x2176, 3.42 sec, gnat (Padding = 8) 2304x2304, 8.47 sec, gnat (Padding = 0) 2304x2304, 4.52 sec, gnat (Padding = 8) 2560x2560, 24.1 sec, gnat (Padding = 0) 2560x2560, 5.42 sec, gnat (Padding = 8) 3072x3072, 38.9 sec, gnat (Padding = 0) 3072x3072, 7.76 sec, gnat (Padding = 8) 3584x3584, 53.2 sec, gnat (Padding = 0) 3584x3584, 11.5 sec, gnat (Padding = 8) Finally, an example on 1-dim arrays, using fast fourier transforms, FFT. The standard, and the most common FFT is computed on a power-of-2 length data set: 0 .. 2**p-1. I timed computation of these FFT's on arrays of length 2**p, and I compared this with computation on arrays of length 2**p + Padding, where Padding = 24. The computation on the padded arrays was faster. The ratio of running times is: p = 10 ratio = .93 p = 11 ratio = .88 p = 12 ratio = .76 p = 13 ratio = .79 p = 14 ratio = .76 p = 15 ratio = .75 p = 16 ratio = .77 p = 17 ratio = .84 p = 18 ratio = .75 p = 19 ratio = .45 p = 20 ratio = .63 p = 21 ratio = .69 p = 22 ratio = .69 p = 22 ratio = .67 p = 24 ratio = .62 p = 25 ratio = .62 So the problem is more common here, and smaller. (These efficiency losses I still consider unacceptable, especially in a routine whose reason for existence is efficiency.) The problem is still worse when you take FFTs of two dimensional arrays. There is another (and entirely independent) reason I prefer routines that perform their transformations on arbitrary sub-matrices (or on arbitrary diagonal blocks) of the data storage matrix. After writing my 1st linear algebra routine, I was very pleased with myself, but it didn't really do what I wanted. I realized I needed to transform the diagonal sub-blocks of a large matrix, and do it iteratively on arbitrarily sized diagonal sub-blocks. It was a very simple matter to modify the code to do this, and the result was so convenient that I've never considered doing it otherwise. Jonathan
From: Hibou57 (Yannick Duchêne) on 13 Feb 2010 20:54 Le Sun, 14 Feb 2010 01:42:13 +0100, jonathan <johnscpg(a)googlemail.com> a écrit: > First example: we eigen-decompose an N x N = 2048 x 2048 matrix. > The data storage matrix is M x M = (1024+Padding) x (1024+Padding) > Here is the running time in seconds of an iterative jacobi > eigen-decomposition: > > 2048x2048: 322 sec, gnat (Padding=24) > 2048x2048: 1646 sec, gnat (Padding=0) > 2048x2048: 1632 sec, gfortran (Padding=0) > 2048x2048: 1492 sec, ifort (Padding=0) > > The observed 500% slowdown in the absence of padded arrays is > unacceptable, even if it is a rare event (occurring only on 2**p sized > data arrays). In fact it's not all that rare ... more comments on > that below. (BTW, ifort is the INTEL fortran compiler, all > optimizations at Max. gfortran is the gcc variant.) So this is mostly about representation clauses finally. Is that it ? Do not know if you already know this document (as I remember I picked it up from some one thread at comp.lang.ada), I've talked about on the other fr.c.l.a : http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf I had pointed about frames #17, #18, #19 et #20, which contains good source of inspiration. Hope this could help you to figure a path. You've posted a long list of tests-bench and observations. I did not looked at every thing, but hope I will have a more closer look at it later. -- No-no, this isn't an oops ...or I hope (TM) - Don't blame me... I'm just not lucky
From: jonathan on 14 Feb 2010 11:16 On Feb 14, 1:54 am, Hibou57 (Yannick Duchêne) <yannick_duch...(a)yahoo.fr> wrote: > Le Sun, 14 Feb 2010 01:42:13 +0100, jonathan <johns...(a)googlemail.com> a > écrit: First example: we eigen-decompose an N x N = 2048 x 2048 matrix. The data storage matrix is M x M = (1024+Padding) x (1024+Padding) should be of course: First example: we eigen-decompose an N x N = 2048 x 2048 matrix. The data storage matrix is M x M = (2048+Padding) x (2048+Padding) > Do not know if you already know this document (as I remember I picked it > up from some one thread at comp.lang.ada), I've talked about on the other > fr.c.l.a :http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls... > I had pointed about frames #17, #18, #19 et #20, which contains good > source of inspiration. Hope this could help you to figure a path. Yes, I remembered this, probably from an old post of yours. I wanted to cite it when when I posted earlier, but I could not find the site. This is not something you forget quickly (frames 17 and 18): 1980: RAM latency ~ 1 cycle 2009: RAM latency ~ 400+ cycles It's the heart of the matter, and it is just getting worse. Helps convince me anyway that I did not waste time on an unimportant matter! In numerical linear algebra the usual solution is to restructure matrices as a collection of blocks. That has a few of its own problems though. Minor footnote: I did some tests on Intel's new nehalem CPU's. Vastly improved performance on these multi-megabyte arrays. Problem not cured though. Don't know enough to say more about it. Thanks for the reminder. Jonathan
From: David Thompson on 16 Feb 2010 01:51
On Tue, 02 Feb 2010 09:52:31 +0100, Jean-Pierre Rosen <rosen(a)adalog.fr> wrote: > Jerry a �crit : > > I've never understood why Ada does not allow slicing in > > multidimensional arrays. What are the safety issues involved? And how > > is it safe to force the programmer into ad hoc methods? > > > One-dimensional slices are simple and efficient. Multidimensional slices > are a can of worms. > > I guess you are thinking about rectangular slices. But why stop there? A > slice comprising the main diagonal and the diagonal above and below can > be very useful for some calculations. Or a slice which is the triangular > part of the upper half... > > AFAICT, Fortran-99 does provide this - and the syntax is so complicated > that nobody uses it. And implementation is also a nightmare. > Fortran 90 (and later) has rectangular but optionally non-unit-stride slices; X(1:5:4,2:6:2) is X(1,2) X(1,4) X(1,6) X(5,2) X(5,4) X(5,6). (Fortran arrays are column-major). (And although it treats string -- fixed length only -- as a different type than array of character, you can use corresponding substrings of elements of an array of strings in a way that is effectively also rectangular.) It also has 'vector subscripting;: X(Y) accesses X(Y(1)) X(Y(2)) ... X(Y(N)) -- but this cannot be used as modifiable actual argument or POINTER target, making it not really a firstclass object/variable. A major new 'coarray' feature, which essentially allows distribution across parallel processing systems, vaguely like an in-language OpenMP (although I hear the standardese must be rather contorted to avoid impermissibly specifying implementation) was proposed for what was scheduled to be F 08, but proved contentious enough that AFAIK it still hasn't been finalized. *PL/I* has the X DEFINED Y (iSUB) syntax which allows things like a diagonal (but AFAICS not more than one at a time). > When designing a programming language, you have to stop at some point. > The ratio (cost of implementation) / usefulness is a good measure for > this. I think the ratio was simply to high for this feature. The features in F90 at least in this area weren't too much of a problem, at least judging from the reports of visibly intelligent and apparently informed people in c.l.f. Although those implementors had the advantage that F90 was originally scheduled for I believe 87, and even before that there had been experience with nonstandard but fairly widespread HPF "High Performance Fortran" extensions. In contrast F03, with adds features mostly in the areas of OO and 'parameterized' types somewhat like Ada discriminated ones, has taken longer although most vendors are now reportedly getting close. |