Kolmogorov-Smirnov 2-sample test [Mathematica]

Prev: A ODE I need to solve
Next: Problems with Workbench Debugger Breakpoints

From: Bill Rowe on 21 Jul 2010 07:11

On 7/20/10 at 3:41 AM, darreng(a)wolfram.com (Darren Glosemeyer) wrote:

>Here is some code written by Andy Ross at Wolfram for the two
>sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>the test statistic, and KSBootstrapPValue provides a bootstrap
>estimate of the p-value given the two data sets, the number of
>simulations for the estimate and the test statistic.

>In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>&]]/Length[data]

>In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>n2 = Length[data2], T},
>e1 = empiricalCDF[sd1, #] & /@ udat;
>e2 = empiricalCDF[sd2, #] & /@ udat;
>T = Max[Abs[e1 - e2]];
>(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>]

After looking at your code above I realized I posted a very bad
solution to this problem. But, it looks to me like there is a
problem with this code. The returned result

(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T

seems to have a extra factor in it. Specifically 1/Sqrt[n1].
Since n1 is the number of samples in the first data set,
including this factor means you will get a different result by
interchanging the order of the arguments to the function when
the number of samples in each data set is different. Since the
KS statistic is based on the maximum difference between the
empirical CDFs, the order in which the data sets are used in the
function should not matter.

From: Andy Ross on 22 Jul 2010 05:42

Bill Rowe wrote:
> On 7/20/10 at 3:41 AM, darreng(a)wolfram.com (Darren Glosemeyer) wrote:
>
>> Here is some code written by Andy Ross at Wolfram for the two
>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>> the test statistic, and KSBootstrapPValue provides a bootstrap
>> estimate of the p-value given the two data sets, the number of
>> simulations for the estimate and the test statistic.
>
>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>> &]]/Length[data]
>
>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>> n2 = Length[data2], T},
>> e1 = empiricalCDF[sd1, #] & /@ udat;
>> e2 = empiricalCDF[sd2, #] & /@ udat;
>> T = Max[Abs[e1 - e2]];
>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>> ]
>
> After looking at your code above I realized I posted a very bad
> solution to this problem. But, it looks to me like there is a
> problem with this code. The returned result
>
> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>
> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> Since n1 is the number of samples in the first data set,
> including this factor means you will get a different result by
> interchanging the order of the arguments to the function when
> the number of samples in each data set is different. Since the
> KS statistic is based on the maximum difference between the
> empirical CDFs, the order in which the data sets are used in the
> function should not matter.
>

You are absolutely correct. The factor should be removed. I believe it
is a remnant of an incomplete copy and paste.

-Andy

From: thomas on 22 Jul 2010 05:44

For what it's worth, here is the code I am using for the KS test. I
got the algorithm from R, and then translated it into Mathematica. I
do get different p-values than in the code by Darren, though. Maybe
the aficionados out there can figure out where the mistake might be.
Note: "data" should contain the two lists, for example

(* create two data sets *)
data1 = RandomReal[NormalDistribution[], 15];
data2 = RandomReal[NormalDistribution[], 15];
data={data1,data2};

kolmogorovTest[data_, modus_: "two-sided"] :=
Module[{untiedData, z, testStatistic, p},

untiedData = data;
With[{badguys = Select[#, #[[2]] > 1 &] & /@ Tally /@ untiedData},
Do[(untiedData[[i,
Flatten[Position[untiedData[[i]], #[[1]]]]]] = #[[1]] + (
Range[0, #[[2]] - 1] - (#[[2]] - 1)/2)/(10 #[[2]])) & /@
badguys[[i]], {i, 2}];
];(*replaces ties (same values, e.g. {x,x,x}) with {x-1/30,x,x+1/
30}*)

z = Accumulate[(1 - #)/Length(a)untiedData[[1]] + -#/
Length(a)untiedData[[2]] &@
UnitStep[Ordering[Join @@ untiedData] - Length(a)untiedData[[1]]]];

p = Which[
modus == "two-sided",
N(a)With[{u =
Max[Abs[z]] Sqrt[Times @@ Length /@ data/
Plus @@ Length /@ data]},
\[Piecewise] {
{1, u < .2},
{1 -
Sqrt[2 Pi]/
u (E^(-((25 \[Pi]^2)/(8 u^2))) +
E^(-((9 \[Pi]^2)/(8 u^2))) + E^(-(\[Pi]^2/(8 u^2)))),
u < .755},
{Total[
Take[{2 E^(-2 u^2), -2 E^(-8 u^2),
2 E^(-18 u^2), -2 E^(-32 u^2)}, Max[1, Round[3/u]]]],
u < 6.8116},
{0, u >= 6.8116}
}
],
modus == "greater",
N(a)Exp[-2 Max[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
modus == "less",
N(a)Exp[-2 Min[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
True,
Return["Wrong modus (has to be \"two-sided\" (default), \"greater\
\", or \"less\")."]];

Return[p]
]

On Jul 21, 1:11 pm, Bill Rowe <readn...(a)sbcglobal.net> wrote:
> On 7/20/10 at 3:41 AM, darr...(a)wolfram.com (Darren Glosemeyer) wrote:
>
>
>
>
>
> >Here is some code written by Andy Ross at Wolfram for the two
> >sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >the test statistic, and KSBootstrapPValue provides a bootstrap
> >estimate of the p-value given the two data sets, the number of
> >simulations for the estimate and the test statistic.
> >In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >&]]/Length[data]
> >In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
> >Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >n2 = Length[data2], T},
> >e1 = empiricalCDF[sd1, #] & /@ udat;
> >e2 = empiricalCDF[sd2, #] & /@ udat;
> >T = Max[Abs[e1 - e2]];
> >(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >]
>
> After looking at your code above I realized I posted a very bad
> solution to this problem. But, it looks to me like there is a
> problem with this code. The returned result
>
> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>
> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> Since n1 is the number of samples in the first data set,
> including this factor means you will get a different result by
> interchanging the order of the arguments to the function when
> the number of samples in each data set is different. Since the
> KS statistic is based on the maximum difference between the
> empirical CDFs, the order in which the data sets are used in the
> function should not matter.

From: Andy Ross on 23 Jul 2010 07:09

Andy Ross wrote:
> Bill Rowe wrote:
>> On 7/20/10 at 3:41 AM, darreng(a)wolfram.com (Darren Glosemeyer) wrote:
>>
>>> Here is some code written by Andy Ross at Wolfram for the two
>>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>>> the test statistic, and KSBootstrapPValue provides a bootstrap
>>> estimate of the p-value given the two data sets, the number of
>>> simulations for the estimate and the test statistic.
>>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>>> &]]/Length[data]
>>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>>> n2 = Length[data2], T},
>>> e1 = empiricalCDF[sd1, #] & /@ udat;
>>> e2 = empiricalCDF[sd2, #] & /@ udat;
>>> T = Max[Abs[e1 - e2]];
>>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>>> ]
>> After looking at your code above I realized I posted a very bad
>> solution to this problem. But, it looks to me like there is a
>> problem with this code. The returned result
>>
>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>>
>> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
>> Since n1 is the number of samples in the first data set,
>> including this factor means you will get a different result by
>> interchanging the order of the arguments to the function when
>> the number of samples in each data set is different. Since the
>> KS statistic is based on the maximum difference between the
>> empirical CDFs, the order in which the data sets are used in the
>> function should not matter.
>>
>
> You are absolutely correct. The factor should be removed. I believe it
> is a remnant of an incomplete copy and paste.
>
> -Andy

I've corrected the error in my code from before. The p-value
computation was giving low estimates because I was using RandomChoice
rather than RandomSample. I believe this should do the job (though
rather slowly).

empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]

splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}

KolmogorovSmirnov2Sample[data1_, data2_] :=
Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
n2 = Length[data2], T},
e1 = empiricalCDF[sd1, #] & /@ udat;
e2 = empiricalCDF[sd2, #] & /@ udat;
T = Max[Abs[e1 - e2]];
(Sqrt[(n1*n2)/(n1 + n2)]) T // N
]

KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
dfts = ConstantArray[0, MCSamp];
Do[
dfts[[i]] =
KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
, {i, MCSamp}
];
Length[Select[dfts, # >= T &]]/MCSamp // N
]

Example:

data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574,
-0.68592, -0.0461418, -0.29906, 0.209381};

data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
-0.0812646, -1.1592, 0.138245, -0.314907};

In[41]:= KolmogorovSmirnov2Sample[data1, data2]

Out[41]= 0.67082

Using 1000 bootstrap samples...

In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]

Out[42]= 0.791

-Andy

From: Ray Koopman on 24 Jul 2010 05:07

On Jul 23, 4:09 am, Andy Ross <an...(a)wolfram.com> wrote:
>
> I've corrected the error in my code from before. The p-value
> computation was giving low estimates because I was using RandomChoice
> rather than RandomSample. I believe this should do the job (though
> rather slowly).
>
> empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]
>
> splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}
>
> KolmogorovSmirnov2Sample[data1_, data2_] :=
> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> n2 = Length[data2], T},
> e1 = empiricalCDF[sd1, #] & /@ udat;
> e2 = empiricalCDF[sd2, #] & /@ udat;
> T = Max[Abs[e1 - e2]];
> (Sqrt[(n1*n2)/(n1 + n2)]) T // N
> ]
>
> KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
> Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
> dfts = ConstantArray[0, MCSamp];
> Do[
> dfts[[i]] =
> KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
> , {i, MCSamp}
> ];
> Length[Select[dfts, # >= T &]]/MCSamp // N
> ]
>
> Example:
>
> data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574,
> -0.68592, -0.0461418, -0.29906, 0.209381};
>
> data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
> -0.0812646, -1.1592, 0.138245, -0.314907};
>
> In[41]:= KolmogorovSmirnov2Sample[data1, data2]
>
> Out[41]= 0.67082
>
> Using 1000 bootstrap samples...
>
> In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]
>
> Out[42]= 0.791
>
> -Andy

The ugly procedural code in ks2 returns the exact p-value very
quickly:

AbsoluteTiming(a)ks2[data1,data2]

{0.005842 Second,{30,0.78693}}

First | Prev | Next | Last
Pages: 1 2 3
Prev: A ODE I need to solve
Next: Problems with Workbench Debugger Breakpoints