From: Jerome on
Thank you in advance for your help.

I am attempting to script my own two-sample kolmogorov-smirnov test. To do this test I need to create empirical cdf values of the data. I am attempting to check my work with the statistics toolbox as I go along. I am going off the general definition that an empirical cdf is

p(x) = Pr{X<=x}

where p(x) is the empirical cdf, X is an arbitrary future value, x is the data, and Pr{} means probability. The following is my attempt at creating an empirical cdf function given some vector of data x;

uniqueVals = unique(x);
nPts = length(x);
for i=1:length(uniqueVals)
cdf(i) = length( find( x<=uniqueVals(i) ) ) / nPts;
end

This matches what ecdf() generates except that I am one element short. ecdf() generates a zero value as the first element of it's output vector F, and repeats the lowest value of x as the first element of it's output vector X (using the function script notation), such that there are length(x)+1 elements. I am completely confused about why it does this. Can somebody explain why this first element is generated?

Cheers,
Jerome
From: Jerome on
Here is a possible answer from a fellow student. He thinks that there are length(x)+1 elements because we want the ability to differentiate the cdf function to get back to our original data x since the cdf is really the integral of the data x. Since we loose a data point when we take the derivative ( cdf(i+1) - cdf(i) ), we don't want to loose an original data point, so ecdf() tacks on an additional element.

I'm not satisfied with this answer though because I don't get my first original data point back from the first derivative step cdf(2)-cdf(1) as this always equals zero with the way that ecdf() generates it's result.

Still confused.

Jerome
From: Tom Lane on
> Here is a possible answer from a fellow student. He thinks that there are
> length(x)+1 elements because we want the ability to differentiate the cdf
> function to get back to our original data x since the cdf is really the
> integral of the data x. Since we loose a data point when we take the
> derivative ( cdf(i+1) - cdf(i) ), we don't want to loose an original data
> point, so ecdf() tacks on an additional element.
>
> I'm not satisfied with this answer though because I don't get my first
> original data point back from the first derivative step cdf(2)-cdf(1) as
> this always equals zero with the way that ecdf() generates it's result.

Jerome, the answer is so simple as to be uninteresting. It's just done so
that the stairs function will plot it nicely. Plot the results with stairs,
and you'll see a picture representing the customary definition of the ecdf.
Plot them with the extra point removed, and you'll see it doesn't look
right.

Your buddy was onto something, though. If you diff the first output, you get
the jump in the ecdf at each data point. This is just 1/n for a complete
sample of size n, but it's more involved when there is censoring. You're
right that diff-ing the second output won't get you much.

-- Tom


From: Jerome on
"Tom Lane" <tlane(a)mathworks.com> wrote in message <ht98ah$qlc$1(a)fred.mathworks.com>...
> Your buddy was onto something, though. If you diff the first output, you get
> the jump in the ecdf at each data point. This is just 1/n for a complete
> sample of size n, but it's more involved when there is censoring. You're
> right that diff-ing the second output won't get you much.
>
> -- Tom
>
Thanks for your reply Tom,

It seems odd to generate additional data solely to accommodate a plotting function. Sorry to be critical but this seems crazy actually. I am almost inclined to call it a bug, yet is clearly intentional.

As for differentiating the ecdf, it seems like the first value would be undefined at 0/0 since the first two X values are identical as well. While I like the idea, it doesn't make sense to me given the output of ecdf(), although I'm probably missing something.

Cheers,
Jerome