From: Viktor Martyanov on
Hello,

I am using kstest2 to compare expected and observed distribution for 2 different samples.

For sample1, these are 10 expected and observed values:

exp1 = [52091;52183;51507;50879;50765;49574;49197;48476;47754;48004];
obs1 = [2;6;35;22;19;4;2;2;0;0];

For sample2, these are 10 expected and observed values:

exp2 = [47419;41725;36808;32752;29025;25401;22519;20004;17746;15639];
obs2 = [0;2;1;0;0;1;2;0;0;0];

While exp1 and exp2 are very similar, obs1 and obs2 seem to be very different. However, when I use kstest2, I am getting the same p-value of 1.9E-05 for both comparisons. What am I missing here?

Thank you,

Viktor
From: Peter Perkins on
Viktor Martyanov wrote:

> I am using kstest2 to compare expected and observed distribution for 2 different samples.
>
> For sample1, these are 10 expected and observed values:
>
> exp1 = [52091;52183;51507;50879;50765;49574;49197;48476;47754;48004];
> obs1 = [2;6;35;22;19;4;2;2;0;0];
>
> For sample2, these are 10 expected and observed values:
>
> exp2 = [47419;41725;36808;32752;29025;25401;22519;20004;17746;15639];
> obs2 = [0;2;1;0;0;1;2;0;0;0];
>
> While exp1 and exp2 are very similar, obs1 and obs2 seem to be very different. However, when I use kstest2, I am getting the same p-value of 1.9E-05 for both comparisons. What am I missing here?

Viktor, perhaps the first thing you're missing is that KSTEST2 is for two-sample problems. You have two samples, but based on your first sentence, you are apparently doing two one-sample tests. It's not clear where your "expected values" come from, but based solely on your description, it seems that exp1 and obs1 cannot be described as independent random samples. If you are testing a random sample against a known distribution, you should be using KSTEST.

That being said, the real issue in using KSTEST2 incorrectly is that the p-values will not be correct. If you are using the p-values simple as a ranking, and not using them as real p-values, i.e., interpreting them as a tail probability under the null hypothesis, then you might be OK. Might.

As for your real question, look at the K-S statistic. It's 1 in both cases, the largest it can possibly be. Thus, the "p-values" are identical.. That's not surprising, I doubt you'd need a statistical test to decide that obs1 is almost certainly not a random sample from a distribution that could have generated exp1.

Hope this helps.
From: Viktor Martyanov on
Peter Perkins <Peter.Perkins(a)MathRemoveThisWorks.com> wrote in message <hejd0a$ssd$1(a)fred.mathworks.com>...

> Viktor, perhaps the first thing you're missing is that KSTEST2 is for two-sample problems. You have two samples, but based on your first sentence, you are apparently doing two one-sample tests. It's not clear where your "expected values" come from, but based solely on your description, it seems that exp1 and obs1 cannot be described as independent random samples. If you are testing a random sample against a known distribution, you should be using KSTEST.
>
> That being said, the real issue in using KSTEST2 incorrectly is that the p-values will not be correct. If you are using the p-values simple as a ranking, and not using them as real p-values, i.e., interpreting them as a tail probability under the null hypothesis, then you might be OK. Might.
>
> As for your real question, look at the K-S statistic. It's 1 in both cases, the largest it can possibly be. Thus, the "p-values" are identical.. That's not surprising, I doubt you'd need a statistical test to decide that obs1 is almost certainly not a random sample from a distribution that could have generated exp1.
>
> Hope this helps.

Hi Peter,

Thank you for the reply.

I will try to clarify some of the points here and will also provide the complete datasets.

Here are four columns of data corresponding to four different distributions of values:

Sample A Sample B Sample C Sample D
2 52091 0 47419
6 52183 2 41725
35 51507 1 36808
22 50879 0 32752
19 50765 0 29025
4 49574 1 25401
2 49197 2 22519
2 48476 0 20004
0 47754 0 17746
0 48004 0 15639
0 47192 0 13674
0 46094 0 12097
0 45499 1 10787
0 44634 0 9586
0 44143 0 8395
0 43596 0 7249
0 43086 0 6743
0 42169 0 5785
0 41685 2 5169
0 41435 1 4631
0 40261 0 4040
0 40032 0 3646
0 38346 0 3244
0 38194 1 3015
0 38257 0 2600
1 37326 0 2199
0 36876 0 2039
1 36498 0 1717
0 35987 3 1558
0 34711 0 1355
0 34492 0 1325
0 33952 0 1187
0 33629 0 1041
0 32559 0 949
0 32029 0 823
0 31248 1 757
0 30820 0 712
0 30441 0 617
0 29618 1 548
0 29357 0 531
0 28564 0 471
0 27729 0 442
0 27216 0 416
0 26795 0 356
0 26105 0 302
0 25631 1 268
0 24825 0 266
0 24801 2 236
0 23558 0 239
0 23039 0 216
0 22522 0 225
0 21739 1 188
0 21113 0 210
0 20519 0 183
0 19938 0 138
0 19300 0 143
0 19240 0 134
0 18182 0 118
0 17331 0 112
0 16909 0 110
0 16550 0 125
0 15774 0 108
0 15094 0 107
0 14830 0 81
0 13965 0 92
0 13505 0 76
0 12873 0 71
0 12024 0 79
0 11446 0 62
0 10777 0 78
0 10168 0 69
0 9611 0 73
0 9246 0 52
0 8443 0 53
0 7889 0 68
0 7270 0 65
0 6582 0 52
0 6461 0 58
0 3062 0 22
0 0 0 0
For each pair (Sample A and Sample B; Sample C and Sample D) I want to compare two respective samples to each other and find out if they have the same distribution.

Using kstest2 and doing

[h, p] = kstest2(A, B);
[h, p] = kstest2(C, D);

I am getting the same p-value of 8.6E-36. So I think that I am using kstest2 comparing two different population samples which seem to have two different distributions and I am still not sure why p-value is the same.

Thank you,

Viktor