Kolmogorov-Smirnov 2-sample test [Mathematica]

Prev: A ODE I need to solve
Next: Problems with Workbench Debugger Breakpoints

From: Aaron Bramson on 26 Jul 2010 06:47

Hello everybody and thank you,

This has been very helpful, and now the two-sided K-S test for Mathematica
is online for everybody to enjoy.

I have implemented the new code from Andy and from Ray on my data set and
the code from Ray works out better for me...though I don't have the skill to
decipher what that "ugly" code is doing, I've verified several results so
I'm using those exact p-values. I'm going to build a table of the p-values
from these tests (which is made into plot over time with the test being
performed on the individual-trial data streams of two cohorts at each time
step).

I have one last question, or maybe it's a request.. In Ray's code if I put
in two data sets wherein all the points are at the same value (e.g. all
zero) the result is not a K-stat of 0, and a p-value of 1, but rather
{-\[Infinity], 0.}. That doesn't seem like the right answer (and in any
case not the answer that I expect or can use) so this input combination
doesn't work with how the technique calculates the stats. So I'd like to
request a small change to the code Ray provided so that if the inputs are
all identical the output is {0,1} instead of {-\[Infinity], 0.}. I could do
this post-facto with a replacement rule, but it would probably be better and
faster to do this in the original calculation. But with THAT code I don't
know where to make the appropriate changes.

Again, thanks everybody for your help.

Best,
Aaron

p.s. I may end up using the Kuiper test and I might therefore have a similar
question about implementing that in Mathematica very soon.

On Fri, Jul 23, 2010 at 7:09 AM, Andy Ross <andyr(a)wolfram.com> wrote:

> Andy Ross wrote:
> > Bill Rowe wrote:
> >> On 7/20/10 at 3:41 AM, darreng(a)wolfram.com (Darren Glosemeyer) wrote:
> >>
> >>> Here is some code written by Andy Ross at Wolfram for the two
> >>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >>> the test statistic, and KSBootstrapPValue provides a bootstrap
> >>> estimate of the p-value given the two data sets, the number of
> >>> simulations for the estimate and the test statistic.
> >>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >>> &]]/Length[data]
> >>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
> >>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >>> n2 = Length[data2], T},
> >>> e1 = empiricalCDF[sd1, #] & /@ udat;
> >>> e2 = empiricalCDF[sd2, #] & /@ udat;
> >>> T = Max[Abs[e1 - e2]];
> >>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>> ]
> >> After looking at your code above I realized I posted a very bad
> >> solution to this problem. But, it looks to me like there is a
> >> problem with this code. The returned result
> >>
> >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>
> >> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> >> Since n1 is the number of samples in the first data set,
> >> including this factor means you will get a different result by
> >> interchanging the order of the arguments to the function when
> >> the number of samples in each data set is different. Since the
> >> KS statistic is based on the maximum difference between the
> >> empirical CDFs, the order in which the data sets are used in the
> >> function should not matter.
> >>
> >
> > You are absolutely correct. The factor should be removed. I believe it
> > is a remnant of an incomplete copy and paste.
> >
> > -Andy
>
> I've corrected the error in my code from before. The p-value
> computation was giving low estimates because I was using RandomChoice
> rather than RandomSample. I believe this should do the job (though
> rather slowly).
>
> empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]
>
> splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}
>
> KolmogorovSmirnov2Sample[data1_, data2_] :=
> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> n2 = Length[data2], T},
> e1 = empiricalCDF[sd1, #] & /@ udat;
> e2 = empiricalCDF[sd2, #] & /@ udat;
> T = Max[Abs[e1 - e2]];
> (Sqrt[(n1*n2)/(n1 + n2)]) T // N
> ]
>
> KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
> Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
> dfts = ConstantArray[0, MCSamp];
> Do[
> dfts[[i]] =
> KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
> , {i, MCSamp}
> ];
> Length[Select[dfts, # >= T &]]/MCSamp // N
> ]
>
> Example:
>
> data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574,
> -0.68592, -0.0461418, -0.29906, 0.209381};
>
> data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
> -0.0812646, -1.1592, 0.138245, -0.314907};
>
> In[41]:= KolmogorovSmirnov2Sample[data1, data2]
>
> Out[41]= 0.67082
>
> Using 1000 bootstrap samples...
>
> In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]
>
> Out[42]= 0.791
>
> -Andy
>

From: Ray Koopman on 26 Jul 2010 07:06

ks2a[y1_,y2_] := Block[{n1 = Length(a)y1, n2 = Length(a)y2,
pool = Sort(a)Join[y1,y2], x,n,u}, If[Equal@@pool, {0,1.}, {x =
Max(a)Abs[n2*Tr(a)UnitStep[y1-#]-n1*Tr(a)UnitStep[y2-#]&/@Rest(a)Union@pool],
n = n1+n2; u = Table[0,{n2+1}]; Do[ Which[
i+j == 0, u[[j+1]] = 1,
i+j < n && pool[[i+j]] < pool[[i+j+1]] && Abs[n2*i-n1*j] >= x,
u[[j+1]] = 0,
i == 0, u[[j+1]] = u[[j]],
j > 0, u[[j+1]] += u[[j]]], {i,0,n1},{j,0,n2}];
N[1 - Last@u/Multinomial[n1,n2]]}] ]

ks2a[{1,1,1},{1,1,1,1}]

{0,1.}

----- Aaron Bramson <aaronbramson(a)gmail.com> wrote:
> Hello everybody and thank you,
>
> This has been very helpful, and now the two-sided K-S test for Mathematica
> is online for everybody to enjoy.
>
> I have implemented the new code from Andy and from Ray on my data set and
> the code from Ray works out better for me...though I don't have the skill to
> decipher what that "ugly" code is doing, I've verified several results so
> I'm using those exact p-values. I'm going to build a table of the p-values
> from these tests (which is made into plot over time with the test being
> performed on the individual-trial data streams of two cohorts at each time
> step).
>
> I have one last question, or maybe it's a request.. In Ray's code if I put
> in two data sets wherein all the points are at the same value (e.g. all
> zero) the result is not a K-stat of 0, and a p-value of 1, but rather
> {-\[Infinity], 0.}. That doesn't seem like the right answer (and in any
> case not the answer that I expect or can use) so this input combination
> doesn't work with how the technique calculates the stats. So I'd like to
> request a small change to the code Ray provided so that if the inputs are
> all identical the output is {0,1} instead of {-\[Infinity], 0.}. I could do
> this post-facto with a replacement rule, but it would probably be better and
> faster to do this in the original calculation. But with THAT code I don't
> know where to make the appropriate changes.
>
> Again, thanks everybody for your help.
>
> Best,
> Aaron
>
> p.s. I may end up using the Kuiper test and I might therefore have a similar
> question about implementing that in Mathematica very soon.

First | Prev |
Pages: 1 2 3
Prev: A ODE I need to solve
Next: Problems with Workbench Debugger Breakpoints