From: Nick Maclaren on 26 Oct 2006 06:39 In article <1161856490.438501.274930(a)b28g2000cwb.googlegroups.com>, "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes: |> |> For n<=4, a matrix will typically fit into a cache line. Often, the |> code can be rejiggered to cache align the matrix and prefetch cache |> lines, although how far in advance of using a matrix its prefetch can |> be done varies by the problem. Not quite. A 4x4 matrix of 64-bit reals is 128 bytes. Aligning the matrix gives at most a factor of two gain. Prefetching is independent of whether the data are scalars or small matrices. Regards, Nick Maclaren.
From: ranjit_mathews@yahoo.com on 26 Oct 2006 10:48 Nick Maclaren wrote: > In article <1161856490.438501.274930(a)b28g2000cwb.googlegroups.com>, > "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes: > |> > |> For n<=4, a matrix will typically fit into a cache line. Often, the > |> code can be rejiggered to cache align the matrix and prefetch cache > |> lines, although how far in advance of using a matrix its prefetch can > |> be done varies by the problem. > > Not quite. A 4x4 matrix of 64-bit reals is 128 bytes. IA64, MIPS 4x00, Power4, and the SUN Fireplane Interconnect have 128 byte cache lines. > Aligning the > matrix gives at most a factor of two gain. Prefetching is independent > of whether the data are scalars or small matrices. An operation on a pair of matrices takes longer than an operation on scalars. So, prefetching before the former operation is prefetching well in advance of use whereas prefetching before the latter operation is prefetching little in advance of use.
From: Nick Maclaren on 26 Oct 2006 13:03 In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>, "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes: |> |> An operation on a pair of matrices takes longer than an operation on |> scalars. So, prefetching before the former operation is prefetching |> well in advance of use whereas prefetching before the latter operation |> is prefetching little in advance of use. Er, no. On a modern system, the time taken to perform even a 4x4 matrix multiply is much less than the time spent fetching the data from memory, so that the time is dominated by the delay in fetching the first values. Whether the matrix is in one cache line or four, the other values will be there when needed. The prefetch has to be the same amount ahead in both cases to eliminate the delay due to the fetch. Regards, Nick Maclaren.
From: ranjit_mathews@yahoo.com on 26 Oct 2006 16:12 Nick Maclaren wrote: > In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>, > "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes: > |> > |> An operation on a pair of matrices takes longer than an operation on > |> scalars. So, prefetching before the former operation is prefetching > |> well in advance of use whereas prefetching before the latter operation > |> is prefetching little in advance of use. > > Er, no. > > On a modern system, the time taken to perform even a 4x4 matrix > multiply is much less than the time spent fetching the data from > memory, so that the time is dominated by the delay in fetching the > first values. Whether the matrix is in one cache line or four, the > other values will be there when needed. Measure the timing of this C++ code on x86-64 with (1) a prefetch member routine that prefetches, and (2) a prefetch member routine that does nothing. class matrix { // fill in the details .... }; typedef matrix(4,4) mat4; #define N (1024*1024); ..... mat4 a[N], b, c[N]; // fill in matrixes with random 64 bit floats ..... for (i=0; i<N; i++) { if (i<N-1) a[i+1].prefetch(); c[i] = a[i]*b; } > The prefetch has to be the same amount ahead in both cases to > eliminate the delay due to the fetch. The latency might not be eliminated but would be reduced to a greater extent in the case of matrix multiplication than in the case of scalar multiplication. Try the above example with scalars. There would be little difference with and without prefetch. In the case of the matrixes, prefetching would make some difference. > Regards, > Nick Maclaren.
From: ranjit_mathews@yahoo.com on 26 Oct 2006 16:24
ranjit_mathews(a)yahoo.com wrote: > Nick Maclaren wrote: > > In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>, > > "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes: > > |> > > |> An operation on a pair of matrices takes longer than an operation on > > |> scalars. So, prefetching before the former operation is prefetching > > |> well in advance of use whereas prefetching before the latter operation > > |> is prefetching little in advance of use. > > > > Er, no. > > > > On a modern system, the time taken to perform even a 4x4 matrix > > multiply is much less than the time spent fetching the data from > > memory, so that the time is dominated by the delay in fetching the > > first values. Whether the matrix is in one cache line or four, the > > other values will be there when needed. > > Measure the timing of this C++ code on x86-64 with > (1) a prefetch member routine that prefetches, and > (2) a prefetch member routine that does nothing. On x86-64, if you do an unaligned implementation, your prefetch routine would have to prefetch the first, last and middle 32 bit word of the storage used for the matrix. > class matrix { > // fill in the details > ... > }; > > typedef matrix(4,4) mat4; > #define N (1024*1024); > > .... > mat4 a[N], b, c[N]; > // fill in matrixes with random 64 bit floats > .... > for (i=0; i<N; i++) { > if (i<N-1) a[i+1].prefetch(); > c[i] = a[i]*b; > } > > > The prefetch has to be the same amount ahead in both cases to > > eliminate the delay due to the fetch. > > The latency might not be eliminated but would be reduced to a greater > extent in the case of matrix multiplication than in the case of scalar > multiplication. Try the above example with scalars. There would be > little difference with and without prefetch. In the case of the > matrixes, prefetching would make some difference. > > > Regards, > > Nick Maclaren. |