stats/probability question [Math]

Prev: (1) + (1+1/4) + (1+1/4+1/9) + ...= gamma(-1) ?
Next: Transform -- the game

From: Rob Johnson on 19 Jul 2010 06:01

In article <Y5b0o.12935$OU6.12052(a)newsfe20.iad>,
"Michael Robinson" <kellrobinson(a)yahoo.com> wrote:
>"Ray Vickson" <RGVickson(a)shaw.ca> wrote in message
>news:230f2988-b158-4cd4-8219-7be34e488f20(a)k1g2000prl.googlegroups.com...
>>On Jul 16, 2:29 pm, Robert Israel
>><isr...(a)math.MyUniversitysInitials.ca> wrote:
>>> Ray Vickson <RGVick...(a)shaw.ca> writes:
>>> > In this case you are told that exactly 550 students support candidate
>>> > A. Now, if 100 of the 1000 show up, AND IF THE SELECTION OF THE 100 IS
>>> > RANDOM, then the number (in 100) voting for A has the so-called
>>> > *hypergeometric distribution*. In general, in a population of size N
>>> > with N1 of type 1 and N2 of type 2 (N1 +N2 = N), for a random sample
>>> > of size n the number X of type 1 in the sample is hypergeometric: Pr{X
>>> > = k} = C(N1,k)*C(N2,n-k)/C(N,n), where C(a,b) = binomial
>>> > coefficient
>>> > "a choose b" = a!/[b!*(a-b)!]. For N1 = 550, N2 = 450 and n =
>>> > 100
>>> > w>>> > e
>>> > have P(k) = Pr{k suppport A} = C(550,k)*C(450,100-k)/C(1000,100),
>>> > and
>>> > you want to compute sum[P(k),k=0.. 49]. The book wants you to
>>> > simulate, but direct computation is easier, especially if you use the
>>> > binomial approximation to the hypergeometric (which should be OK
>>> > because n = 100 is small compared with N = 1000 and the point of
>>> > interest (k = 49) is near the middle of the range 0..100). The
>>> > binomial would be exact for "sampling with replacement", where we
>>> > select 100 students randomly, one-by-one, so the same student can, by
>>> > chance, be selected more than once. Since there are 1000 students and
>>> > we are just selecting 100 there is not much chance of having a
>>> > "duplicate" in the sample,
>>>
>>> On the contrary, the probability of having at least one duplicate in the
>>> sample is very high: 1 - (1000!/900!)/1000^100 = .9940410734
>>> approximately.
>>> But there are probably not very many duplicates, so the binomial
>>> approximation
>>> is not
>>> too bad (still, it's not very good, as noted in my previous posting).
>>> --
>>> Robert Israel isr...(a)math.MyUniversitysInitials.ca
>>> Department of Mathematics http://www.math.ubc.ca/~israel
>>> University of British Columbia Vancouver, BC, Canada
>>
>>Well, here are results for various "large" N, showing Pr{most popular
>>candidate loses} = Pr{Votes <= 49} for the hypergeometric and binomial
>>cases (from Maple 9.5):
>>N hypergeom binomial
>>1000 1.220852e-01 1.345762e-01
>>1500 1.263773e-01 1.345762e-01
>>2000 1.284742e-01 1.345762e-01
>>2500 1.297170e-01 1.345762e-01
>>3000 1.305392e-01 1.345762e-01
>>3500 1.311235e-01 1.345762e-01
>>4000 1.315600e-01 1.345762e-01
>>
>>When N is large enough that both .55*N and .45*N are, say, more than
>>10 times as large as the sample size n = 100, the binomial and
>>hypergeometric cases are the same to about two decimal places.
>>
>>R.G. Vickson
>
>The relative error is on the order of sample divided by population.
>E.g., for population 1000:
>0.1346-0.1221=0.125
>(10/1000)(0.1221) = .0122
>It gets more accurate with bigger numbers.

Indeed. To make this precise, we can write the summand as

C(9n/20,k) C(11n/20,100-k)
--------------------------
C(n,100)

= (9n/20)(9n/20 - 1)(9n/20 - 2)...(9n/20 - (k-1))

* (11n/20)(11n/20 - 1)(11n/20 - 2)...(11n/20 - (99-k))

/ (n)(n-1)(n-2)...(n-99)

* C(100,k)

= (9/20)(9/20 - 1/n)(9/20 - 2/n)...(9/20 - (k-1)/n)

* (11/20)(11/20 - 1/n)(11n/20 - 2/n)...(11n/20 - (99-k)/n)

/ (1)(1-1/n)(1-2/n)...(1-99/n)

* C(100,k) [1]

Examining [1], we can see why, in the limit as n tends to infinity,
the hypergeometric term approaches the binomial term

(9/20)^k (11/20)^{100-k} C(100,k) [2]

We can use Newton's Identities

<http://en.wikipedia.org/wiki/Newton%27s_identities>

to get an asymptotic expansion for [1]. That is, writing the three
extended product terms asymptotically in 1/n, we get

(9/20)(9/20 - 1/n)(9/20 - 2/n)...(9/20 - (k-1)/n)

= (9/20)^k

( 1 - (20/9) (k^2 - k)/2 1/n

+ (20/9)^2 (3k^4 - 10k^3 + 9k^2 - 2k)/24 1/n^2

- (20/9)^3 (k^6 - 7k^5 + 17k^4 - 17k^3 + 6k^2)/48 1/n^3

+ O(1/n^4) ) [3]

(11/20)(11/20 - 1/n)(11n/20 - 2/n)...(11n/20 - (99-k)/n)

= (11/20)^{100-k}

( 1 - (20/11) (k^2 - 199k + 9900)/2 1/n

+ (20/11)^2 (3k^4 - 1190k^3 + 177009k^2 - 11701798k

+ 290089800)/24 1/n^2

- (20/11)^3 (k^6 - 593k^5 + 146517k^4 - 19306783k^3

+ 1431014906k^2 - 56567491200k

+ 931683060000)/48 1/n^3

+ O(1/n^4) ) [4]

(1)(1-1/n)(1-2/n)...(1-99/n)

= (1 - 4950 1/n + 12087075 1/n^2 - 19410063750 1/n^3

+ O(1/n^4) ) [5]

Multiplying C(100,k) by [3] x [4] / [5], and summing from k = 51 to
k = 100, we get

.13457621318805238678 - 11.928556575663902106 1/n

- 539.65341919938817150 1/n^2 - 22350.705078834868693 1/n^3

+ O(1/n^4) [6]

Using [6] to estimate the same values of n as in Ray Vickson's
table, we get

n asymptotic actual
1000 .1220856524881102616 .12208522168818469976
1500 .1263773737423866988 .12637728404583215165
2000 .1284742277072857343 .12847419866997887870
2500 .1297170155655898784 .12971700351911424587
3000 .1305392383679171231 .13053923251100078796
3500 .1311234795268475006 .13112347634755554495
4000 .1315599964756695927 .13155999460430289938

Rob Johnson <rob(a)trash.whim.org>
take out the trash before replying
to view any ASCII art, display article in a monospaced font

First | Prev |
Pages: 1 2 3
Prev: (1) + (1+1/4) + (1+1/4+1/9) + ...= gamma(-1) ?
Next: Transform -- the game