From: Adam Thibideau on
Hi there,

I have the following code to count how many points in a time series i will have when I average my input data. My problem is that my code takes way too long. Any suggestions on how to speed things up would be great because my input is about 60 million rows of csv data. Thanks in advance!

function x = textscanandintervalize(nrows)
t = cputime;

fid=fopen('163705359.csv');

i=1;
aaplCount = 0;
ibmCount = 0;
dellCount = 0;

textmatrix = textscan(fid,'%s%d%s%f32',nrows,'Delimiter',',','HeaderLines','1');
SymbolC = textmatrix{1};
YearV = textmatrix{2};
TimeC = textmatrix{3};
lengthofMatrix = length(SymbolC);

while(i<=lengthofMatrix)
symb = SymbolC(i);

tm={
[]
double(YearV(i))
char(TimeC(i))
};

currentDate = datevec(sprintf('%d %s',tm{2,1},tm{3,1}),'yyyymmdd HH:MM:SS');
currentHour = currentDate(4);
currentMin = currentDate(5);
floorDate = currentDate;
floorDate(5) = currentDate(5) - mod(currentDate(5),5);
floorDate(6) = 0;
symbol = symb;

while(isequal(symbol,symb) && isequal(currentHour,floorDate(4)) && isequal((currentMin - mod(currentMin,5)),floorDate(5)))
i = i+1;
if(i> lengthofMatrix)
textmatrix = textscan(fid,'%s%d%s%f32',nrows,'Delimiter',',','HeaderLines','1');
SymbolC = textmatrix{1};
YearV = textmatrix{2};
TimeC = textmatrix{3};
lengthofMatrix = length(SymbolC);
currentDate = datevec(sprintf('%d %s',tm{2,1},tm{3,1}),'yyyymmdd HH:MM:SS');
currentHour = currentDate(4);
currentMin = currentDate(5);
floorDate = currentDate;
floorDate(5) = currentDate(5) - mod(currentDate(5),5);
floorDate(6) = 0;
i=1;
end
if(i > lengthofMatrix)
break
end
symbol = SymbolC(i);
tm={
[]
double(YearV(i))
char(TimeC(i))
};
currentDate = datevec(sprintf('%d %s',tm{2,1},tm{3,1}),'yyyymmdd HH:MM:SS');
currentHour = currentDate(4);
currentMin = currentDate(5);
end

if(strcmp(symbol,'AAPL'))
aaplCount = aaplCount +1
else if(strcmp(symbol,'IBM'))
ibmCount = ibmCount +1
else if(strcmp(symbol,'DELL'))
dellCount = dellCount +1
end
end
end


end

fclose(fid);

disp(aaplCount);
disp(ibmCount);
disp(dellCount);
x=zeros(1);
t = cputime - t
end
From: Adam Thibideau on
Please help me! This is taking forever!

Thanks!
From: Walter Roberson on
Adam Thibideau wrote:

> I have the following code to count how many points in a time series i
> will have when I average my input data. My problem is that my code
> takes way too long. Any suggestions on how to speed things up would be
> great because my input is about 60 million rows of csv data. Thanks in
> advance!

Take a subset of the input file, run the program on that subset, using
the profiler to measure the performance. Look at the results of the
profiler to determine what is taking the most time and concentrate on
improving the performance of that.

Hints:

- parsing the year in as a decimal number and then sprintf()'ing that
decimal back into a string again is a waste of time: you might as well
leave it as a string

- your code assumes that the date will never jump forward into the same
5 minute slot; if you are willing to make that assumption than nearly
all of your date processing is a waste of time and you can make do with
string comparisons of the hour text fields and marginally more
sophisticated string comparisons of the minute text fields.
From: Jan Simon on
Dear Adam,


> tm={
> []
> double(YearV(i))
> char(TimeC(i))
> };
>
> currentDate = datevec(sprintf('%d %s', tm{2,1}, tm{3,1}), 'yyyymmdd HH:MM:SS');

There is absolutely no need to create the cell "tm". This was just an example to demonstrate the usage of SPRINTF to create a valid date string:
http://www.mathworks.com/matlabcentral/newsreader/view_thread/286603

The above lines can be simplified to:
currentDate = datevec(sprintf('%d %s', YearV(i), TimeC{i}),'yyyymmdd HH:MM:SS');

But I showed you a faster method alreayd to get the date vector. Because DATEVEC wastes the most time in your program, it would be a good idea to use it:
Year = floor(D / 1000);
Month = rem(D / 100, 100);
Day = rem(D, 100);
Time = sscanf(T, '%d:%d:%d');
V = [Year, Month, Day, reshape(Time, 1, 3)];

Kind regards, Jan
From: Adam Thibideau on
"Jan Simon" <matlab.THIS_YEAR(a)nMINUSsimon.de> wrote in message <i29sj2$gc8$1(a)fred.mathworks.com>...
> Dear Adam,
>
>
> > tm={
> > []
> > double(YearV(i))
> > char(TimeC(i))
> > };
> >
> > currentDate = datevec(sprintf('%d %s', tm{2,1}, tm{3,1}), 'yyyymmdd HH:MM:SS');
>
> There is absolutely no need to create the cell "tm". This was just an example to demonstrate the usage of SPRINTF to create a valid date string:
> http://www.mathworks.com/matlabcentral/newsreader/view_thread/286603
>
> The above lines can be simplified to:
> currentDate = datevec(sprintf('%d %s', YearV(i), TimeC{i}),'yyyymmdd HH:MM:SS');
>
> But I showed you a faster method alreayd to get the date vector. Because DATEVEC wastes the most time in your program, it would be a good idea to use it:
> Year = floor(D / 1000); <<<----------------- WHAT IS D??
> Month = rem(D / 100, 100);
> Day = rem(D, 100);
> Time = sscanf(T, '%d:%d:%d'); <<<---------WHAT IS T???
> V = [Year, Month, Day, reshape(Time, 1, 3)];
>
> Kind regards, Jan

Jan, Thanks for this suggestions. I would like to use it but I am a little confused. What is D and T ??