From: Matt J on
"Juliette Salexa" <juliette.physicist(a)gmail.com> wrote in message <hpghsb$in0$1(a)fred.mathworks.com>...

>
> The majority of the time was spent creating the matrix A, and almost zero time doing the summation.
===============

Apples and oranges. sprand is written in M-code, whereas the summation is executed almost entirely in optimized C-code.
From: Matt J on
"Juliette Salexa" <juliette.physicist(a)gmail.com> wrote in message <hpghsb$in0$1(a)fred.mathworks.com>...

>
> I would suspect (just intuitively) that these computations would be more expensive than the data transfers btwn RAM and CPU.
=====

Even if that's true, the fact is that in the real world, you would have no choice but to store the matrix on hard drive, if you were to precompute it, as opposed to RAM. So, what you really need to be analyzing is whether the computation time of M(i,j) outweighs data transfer time from disk, rather than with data transfer from RAM.

If it does, it might be worthwhile to pre-compute your M and store it on hard drive after all. Note, also, that this might not rule out using MATLAB to do the matrix- vector multiplication. MATLAB workspace variables can store their data on disk using the MEMMAPFILE function. You could try it, I suppose...

Typically, though, for matrices of this size (100GB), people don't expect pre-computation to be optimum and opt for on-the-fly computation of the M(i,j). However, I don't know how thoroughly this assessment is ever made.
From: Juliette Salexa on
"Matt J " <mattjacREMOVE(a)THISieee.spam> wrote in message <hpj3ce$ihc$1(a)fred.mathworks.com>...
> "Juliette Salexa" <juliette.physicist(a)gmail.com> wrote in message <hpghsb$in0$1(a)fred.mathworks.com>...
>
> > in the real world, you would have no choice but to store the matrix on hard drive, if you were to precompute it, as opposed to RAM. So, what you really need to be analyzing is whether the computation time of M(i,j) outweighs data transfer time from disk, rather than with data transfer from RAM.

Thanks Matt J,

I've calculated the RAM requirements and for the cluster that I'm using, storing the matrix won't be a problem

> Apples and oranges. sprand is written in M-code, whereas the summation is executed almost entirely in optimized C-code

Thanks for pointing that out, I realize it was a bad example now that I see that the summation had an unfair advantage .. and I'm starting to see that what I asked originally is not easy to analyze and needs to be examined on a case-by-case basis
From: vortse a on
If I were you I would write the forloop to compute on the fly a submatrix of M large enough to be stored in the RAM, do the multiplication of said submatrix with the corresponding portion of V and add this to your sum. This way you will drastically reduce the number of loops you will have to perform and you can optimise the submatrix size for performance. Doesn't the computation of the elements of M get a speedup from vectorisation as well?
From: Juliette Salexa on
"vortse a" <sonoffeanor-remove(a)yahoo.com> wrote in message <hpsegb$975$1(a)fred.mathworks.com>...
> If I were you I would write the forloop to compute on the fly a submatrix of M large enough to be stored in the RAM, do the multiplication of said submatrix with the corresponding portion of V and add this to your sum. This way you will drastically reduce the number of loops you will have to perform and you can optimise the submatrix size for performance. Doesn't the computation of the elements of M get a speedup from vectorisation as well?

Hello, thank you for the suggestion,

Using submatrices seems like a very sensible thing to do.

As for whether or not the computations to create the matrix M can be sped-up from vectorization. That's another problem that I've been thinking about which is probably even bigger than the question I posted. It would involve constructing a tensor, with as many elements as the forloop, whose rank is somewhere around [log(base4) of the number of elements in the forloop] ...

In theory I expect that using this approach would be faster than using an embedded forloop (a forloop for each iteration of the forloop for the matrix-vector multiplication) , because very many iterations of the forloop could be taken care of at once (one can imagine constructing M from the outer product of two vectors .. that would be faster than calculating the elements of M with a forloop) ... but I think it would be dependent on how well the tensorPackage that I'm using was coded. I'm not sure if the overhead of using the tensorPackage would outweigh the cost of calculating the elements in a forloop.

Thanks,
Juliette.