From: dpb on
Ayotunde wrote:
> Basically i've got data consisting of observations on the S&P index for
> 6 years which for example for year 1999, the csv looks like
> SYMBOL,DATE,TIME,PRICE,SIZE,G127,CORR,COND,EX
>> SPY,19980102,09:31:41,97.3125,53500,0,0,,A
>> SPY,19980102,09:31:43,97.3125,100,0,0,,M
>> SPY,19980102,09:31:43,97.3125,200,0,0,,M
>> SPY,19980102,09:31:44,97.3125,4800,0,0,,P
>> SPY,19980102,09:31:44,97.3125,100,0,0,,M
....
>> SPY,19980102,09:31:44,97.3125,200,0,0,,M
>> SPY,19980102,09:31:44,97.3125,1000,0,0,,M
>> SPY,19980102,09:31:45,97.3125,900,0,0,,M
>> SPY,19980102,09:31:45,97.3125,300,0,0,,M
>> SPY,19980102,09:31:45,97.3125,200,0,0,,M
>> SPY,19980102,09:31:45,97.3125,300,0,0,,M
>> SPY,19980102,09:31:52,97.3125,200,0,0,,T
>> SPY,19980102,09:32:03,97.4375,1000,0,0,,A
>> SPY,19980102,09:32:11,97.4375,1000,0,0,,A
>> SPY,19980102,09:32:20,97.3125,100,0,0,Z,P
>> SPY,19980102,09:32:21,97.3125,200,0,0,,T
....

> my question is that i am trying to replicate the results a paper
> (article) and they have used 5 minute observations however as is visible
> from the example of the csv, my data is more frequent than every 5
> minutes. ...

Well, I don't know what the original paper's data stream looked like to
start with, either, but you don't have data in the sample dataset at
5-minute intervals. You have both replications at the same time stamp
and then intervals of (apparently) asynchronous time stamps at varying
intervals from 1 to as much as 9 minutes between samples in just this
short set.

So, there are some 5-minute intervals for which you have no data at all.

Methinks you need to look at the data source more carefully or
reevaluate the analysis you're comparing to before worrying to much
about the details of selection.

--
From: ImageAnalyst on
And finding the increment for 5 minutes is pretty easy empirically:
for k = 729757.397001 : .00001 : 729757.400800
deltaTime = k - 729757.397001;
fprintf(1, 'k=%f, time=%s, delta time = %f\n',...
k, datestr(k), deltaTime)
end

timeIncrement = 0.00348;
resampledTimes = 729757.397002315 : timeIncrement : 729758.39780;

for k = 1 : length(resampledTimes)
fprintf(1, 'k=%f, time=%s\n', k, datestr(resampledTimes(k)))
end

Looks to be about 0.00348 corresponds to about a 5 minute difference.
So can't you just pass in your first column (times) as your x, your
second column (the prices, with duplicates deleted) as y, and
resampledTimes as xi into interp1 to get your new interpolated
values? They'll be the interpolated values taken every 5 minutes.
From: us on
ImageAnalyst <imageanalyst(a)mailinator.com> wrote in message <f0aa7e5f-3e85-45d7-988a-8f23cab72ffa(a)p17g2000vbe.googlegroups.com>...
> And finding the increment for 5 minutes is pretty easy empirically:
> for k = 729757.397001 : .00001 : 729757.400800
> deltaTime = k - 729757.397001;
> fprintf(1, 'k=%f, time=%s, delta time = %f\n',...
> k, datestr(k), deltaTime)
> end
>
> timeIncrement = 0.00348;
> resampledTimes = 729757.397002315 : timeIncrement : 729758.39780;
>
> for k = 1 : length(resampledTimes)
> fprintf(1, 'k=%f, time=%s\n', k, datestr(resampledTimes(k)))
> end
>
> Looks to be about 0.00348 corresponds to about a 5 minute difference.
> So can't you just pass in your first column (times) as your x, your
> second column (the prices, with duplicates deleted) as y, and
> resampledTimes as xi into interp1 to get your new interpolated
> values? They'll be the interpolated values taken every 5 minutes.

a minute in ML time space

t0=now;
ts=t0+(0:9)/(24*60); % <- 1/hr/min
datestr(ts)
%{
16-May-2010 17:24:58
16-May-2010 17:25:58
16-May-2010 17:26:58
16-May-2010 17:27:58
16-May-2010 17:28:58
16-May-2010 17:29:58
16-May-2010 17:30:58
16-May-2010 17:31:58
16-May-2010 17:32:58
16-May-2010 17:33:58
%}

us
From: Ayotunde on
ImageAnalyst <imageanalyst(a)mailinator.com> wrote in message <f0aa7e5f-3e85-45d7-988a-8f23cab72ffa(a)p17g2000vbe.googlegroups.com>...
> And finding the increment for 5 minutes is pretty easy empirically:
> for k = 729757.397001 : .00001 : 729757.400800
> deltaTime = k - 729757.397001;
> fprintf(1, 'k=%f, time=%s, delta time = %f\n',...
> k, datestr(k), deltaTime)
> end
>
> timeIncrement = 0.00348;
> resampledTimes = 729757.397002315 : timeIncrement : 729758.39780;
>
> for k = 1 : length(resampledTimes)
> fprintf(1, 'k=%f, time=%s\n', k, datestr(resampledTimes(k)))
> end
>
> Looks to be about 0.00348 corresponds to about a 5 minute difference.
> So can't you just pass in your first column (times) as your x, your
> second column (the prices, with duplicates deleted) as y, and
> resampledTimes as xi into interp1 to get your new interpolated
> values? They'll be the interpolated values taken every 5 minutes.

hey ImageAnalyst, i seem to have more success with your method than with us's histc method but is there a way of limiting the resampled times (e.g 9.30-16.20). Also i get an error when trying to use interpl because even though i have deleted the duplicated rows, there ares till some duplicates in the time column as in the data i have prices seem to change once in while even though time hasnt changed. This means that interpl gives me this error
prices=interp1(q1,q2,resampledtimes);
??? Error using ==> interp1 at 261
The values of X should be distinct.
From: ImageAnalyst on
Yes you can determine the starting and ending times. Just figure out
what crazy long floating point numbers correspond to the times you
want to limit it to, and just pass in that subarray. Extract that out
to a brand new variable if you want. Then pass it in to interp1().

If you have two different prices for the same time then you need to
get rid of one of them. You could just go through with a for loop
although there probably is a more clever vectorized way to do it using
diff() on the times column (and then looking for zeros), such as this
little example:

a=[1 2;...
3 4;...
3 6; ...
4 5]
d = diff(a(:,1)) == 0
a(d,:) = []