From: gearhead on
I'm an engineering undergrad in an intro stats course. We had a
question in the book that's really dumb.

problem as stated:

Your candidate has 55% of the votes in the entire school. But only
100 students will show up to vote. What is the probability that the
underdog (the one with 45% support) will win? To find out, set up a
simulation.
a) Describe how you will simulate a component and its outcomes.
b) Describe how you will simulate a trial.
c) Describe the response variable.

The answer in the back of the book says using a two digit random
number to determine each vote (00-54 for your candidate, 55-99 for the
underdog) you would run a string of trials with 100 votes to each
trial.

Now, this is one misconceived exercise. Let me explain why.

Say the school has 1000 students. If all of them show up, the
underdog has 0% chance of winning. If exactly one voter shows up,
underdog has 45% chance of winning. In an election where 100 voters
show up, underdog's chance of winning the election HAS to lie
somewhere between 0% and 45%. No ifs, ands or buts.
The probability of a win for underdog can never exceed 45%. When the
exercise asks "how often will the underdog win" I interpret that as
meaning what are his chances, i.e., the probability that he will win.
But if you run a simulation, you can get anything, including results
above 45%. I don't think simulating has any validity here, at least
the procedure suggested in the answer key. That is a lot of
simulating to do by hand, 100 per trial, but it is nowhere close to
even starting to answer the actual question. You would first of all
have to know the population of the school and then do some very
demanding simulations that would only be practical on a computer.

But leaving practical considerations aside, that question is
meaningless without knowing something about the magnitude of the
school population.
Consider: if the total population is 108, the underdog cannot win,
because he only has 49 (48.6 rounded up) supporters total. Chance of
winning 0%. Period. "Underdog" has NO CHANCE of winning the
election. But if you run a simulation the way the book suggests, he's
going to win some.
I'm saying the book is wrong.
Back to our school of 1000 students, out of whom 450 would vote for
"underdog." If only 100 students vote, what are his chances of
winning? Simulation will send you on the wrong track here unless
you're ready for some head scratching and a big grind on the computer,
but I'm sure this problem has a pretty simple theoretical solution.
From: porky_pig_jr on
On Jul 16, 1:24 pm, gearhead <nos...(a)billburg.com> wrote:
>
> But leaving practical considerations aside, that question is
> meaningless without knowing something about the magnitude of the
> school population.

Yes.

> Back to our school of 1000 students, out of whom 450 would vote for
> "underdog."  If only 100 students vote, what are his chances of
> winning?   Simulation will send you on the wrong track here unless
> you're ready for some head scratching and a big grind on the computer,
> but I'm sure this problem has a pretty simple theoretical solution.

Let those voting for non-underdog be A-students, those voting for
underdog are B-students. You want to choose 100 out of 1000 so that
the number of B-students is at least 51.

Let "win" be choosing B-student, "loose" - choose A-student. p, the
probability of win in a single trial is 0.45, q, the probability of
loss in a single trial = 1 - p = 0.55. A probability of wining exactly
51 out of 100 trials is (0.45^51) * (0.55^49). The number of such
choices is "100 choose 51", or 100 C 51. Then probability of choosing
exactly 51 out of a 100 is
(100 C 51) (0.45^51) (0.55^49). This is binomial distribution, look up
on a web or in any elementary probability textbook for more details.

Now choosing at least 51 is choosing 51 or 52 or ... or 100. Those are
mutually exclusive events, and so to compute probability of choosing
at least 51, add probability of choosing 51 + probability of choosing
52 + ... + probability of choosing 100.

So, as you can see, it does take a while to compute, even with
calculator. Of course, you can write a very simple program, no big
deal. In any case, notice that computing (100 C 51), (100 C 52), etc.
involves factorials. So you may get very large numbers and integers
overflow. And, of course, without knowing the total population we
can't determine the value of p, so, yes, the problem, as it's stated,
lacks some key information.
From: porky_pig_jr on
On Jul 16, 1:53 pm, "porky_pig...(a)my-deja.com" <porky_pig...(a)my-
deja.com> wrote:
> On Jul 16, 1:24 pm, gearhead <nos...(a)billburg.com> wrote:
>
>
>
> > But leaving practical considerations aside, that question is
> > meaningless without knowing something about the magnitude of the
> > school population.
>
> Yes.
>

Well, scratch the rest out. I was too quick. And wrong. Notice that my
solution even didn't take into account the total population.

I think, the correct solution goes like this. We have 550 A-students,
450 B-students. Say, we want to select exactly 49 A-students and 51 B-
students. This is hypergeometric distribution: choosing without
replacement. Now we can choose 49 out of 550 A-students in (550 C 49)
different ways, we can choose 51 out of B-students in (450 C 51)
different ways. And we can choose any 100 out of 1000 students in
(1000 C 100) different ways. So choosing exactly 51 B-students (and 49
A-students) out of total of 550 A-students and 450 B-students (and
order does not matter) has a probability

(550 C 49) * (450 C 51)
--------------------------
1000 C 100

And that was just for exactly 51 B-students. Now you have to compute
the same for 52, 53, ... 100 B-students.

My mistake: Binomial distribution is associated with trials with
replacement. But these are clearly trials *without* replacement. Every
time we pick up the students out of the total population, we don't put
it back. So it is hypergeometric, not binomial. The trials are *not*
"independent identically distributed", like in binomial.

Sorry about that.

PPJ.
From: Ray Vickson on
On Jul 16, 10:24 am, gearhead <nos...(a)billburg.com> wrote:
> I'm an engineering undergrad in an intro stats course.  We had a
> question in the book that's really dumb.
>
> problem as stated:
>
>         Your candidate has 55% of the votes in the entire school.  But only
> 100 students will show up to vote.  What is the probability that the
> underdog (the one with 45% support) will win?  To find out, set up a
> simulation.
> a)  Describe how you will simulate a component and its outcomes.
> b)  Describe how you will simulate a trial.
> c)  Describe the response variable.
>
> The answer in the back of the book says using a two digit random
> number to determine each vote (00-54 for your candidate, 55-99 for the
> underdog) you would run a string of trials with 100 votes to each
> trial.
>
> Now, this is one misconceived exercise.  Let me explain why.
>
> Say the school has 1000 students.

In this case you are told that exactly 550 students support candidate
A. Now, if 100 of the 1000 show up, AND IF THE SELECTION OF THE 100 IS
RANDOM, then the number (in 100) voting for A has the so-called
*hypergeometric distribution*. In general, in a population of size N
with N1 of type 1 and N2 of type 2 (N1 +N2 = N), for a random sample
of size n the number X of type 1 in the sample is hypergeometric: Pr{X
= k} = C(N1,k)*C(N2,n-k)/C(N,n), where C(a,b) = binomial coefficient
"a choose b" = a!/[b!*(a-b)!]. For N1 = 550, N2 = 450 and n = 100 we
have P(k) = Pr{k suppport A} = C(550,k)*C(450,100-k)/C(1000,100), and
you want to compute sum[P(k),k=0.. 49]. The book wants you to
simulate, but direct computation is easier, especially if you use the
binomial approximation to the hypergeometric (which should be OK
because n = 100 is small compared with N = 1000 and the point of
interest (k = 49) is near the middle of the range 0..100). The
binomial would be exact for "sampling with replacement", where we
select 100 students randomly, one-by-one, so the same student can, by
chance, be selected more than once. Since there are 1000 students and
we are just selecting 100 there is not much chance of having a
"duplicate" in the sample, so there is not much difference between the
exact hypergeometric and approximate binomial. If you do use the
binomial you can get the solution using a spreadsheet, or even a
decent scientific hand-held calculator. For the exact hypergeometric
you can use EXCEL's built-in hypergeometric calculator to compute the
P(k) for k = 0..49 then add them up, or you can use an on-line
calculator, such as
http://stattrek.com/Tables/Hypergeometric.aspx . Of course, you can
also simulate, and maybe the (unnamed) book wants you to do that in
order to familiarize yourself with simulation tools.

What about if you don't know the school population? If you ASSUME the
population is large, say N = 1000 or more, then the hypergeometric h(.
55N,.45N,100) is almost the same as the binomial Bi(100,.55), so you
can just use the binomial approximation. However, your original
complaint is valid. In particular, if N is not much larger than the n
= 100, the results will depend critically on the precise value of N.

R.G. Vickson

> If all of them show up, the
> underdog has 0% chance of winning.  If exactly one voter shows up,
> underdog has 45% chance of winning.  In an election where 100 voters
> show up,  underdog's chance of winning the election HAS to lie
> somewhere between 0% and 45%.  No ifs, ands or buts.
> The probability of a win for underdog can never exceed 45%.  When the
> exercise asks "how often will the underdog win" I interpret that as
> meaning what are his chances, i.e., the probability that he will win.
> But if you run a simulation, you can get anything, including results
> above 45%.  I don't think simulating has any validity here, at least
> the procedure suggested in the answer key.  That is a lot of
> simulating to do by hand, 100 per trial, but it is nowhere close to
> even starting to answer the actual question.  You would first of all
> have to know the population of the school and then do some very
> demanding simulations that would only be practical on a computer.
>
> But leaving practical considerations aside, that question is
> meaningless without knowing something about the magnitude of the
> school population.
> Consider:  if the total population is 108, the underdog cannot win,
> because he only has 49 (48.6 rounded up) supporters total.  Chance of
> winning 0%.  Period.  "Underdog" has NO CHANCE of winning the
> election.  But if you run a simulation the way the book suggests, he's
> going to win some.
> I'm saying the book is wrong.
> Back to our school of 1000 students, out of whom 450 would vote for
> "underdog."  If only 100 students vote, what are his chances of
> winning?   Simulation will send you on the wrong track here unless
> you're ready for some head scratching and a big grind on the computer,
> but I'm sure this problem has a pretty simple theoretical solution.

From: Robert Israel on
"porky_pig_jr(a)my-deja.com" <porky_pig_jr(a)my-deja.com> writes:

> On Jul 16, 1:53=A0pm, "porky_pig...(a)my-deja.com" <porky_pig...(a)my-
> deja.com> wrote:
> > On Jul 16, 1:24=A0pm, gearhead <nos...(a)billburg.com> wrote:
> >
> >
> >
> > > But leaving practical considerations aside, that question is
> > > meaningless without knowing something about the magnitude of the
> > > school population.
> >
> > Yes.
> >
>
> Well, scratch the rest out. I was too quick. And wrong. Notice that my
> solution even didn't take into account the total population.
>
> I think, the correct solution goes like this. We have 550 A-students,
> 450 B-students. Say, we want to select exactly 49 A-students and 51 B-
> students. This is hypergeometric distribution: choosing without
> replacement. Now we can choose 49 out of 550 A-students in (550 C 49)
> different ways, we can choose 51 out of B-students in (450 C 51)
> different ways. And we can choose any 100 out of 1000 students in
> (1000 C 100) different ways. So choosing exactly 51 B-students (and 49
> A-students) out of total of 550 A-students and 450 B-students (and
> order does not matter) has a probability
>
> (550 C 49) * (450 C 51)
> --------------------------
> 1000 C 100
>
> And that was just for exactly 51 B-students. Now you have to compute
> the same for 52, 53, ... 100 B-students.
>
> My mistake: Binomial distribution is associated with trials with
> replacement. But these are clearly trials *without* replacement. Every
> time we pick up the students out of the total population, we don't put
> it back. So it is hypergeometric, not binomial. The trials are *not*
> "independent identically distributed", like in binomial.

True: the correct distribution is hypergeometric; the binomial
distribution can be used as an approximation to it, but only in the
case where the population is much larger than the sample size.
There actually is a formula for the cumulative distribution function:
if the population size is N of which S are A-students and the other N-S
are B-students, and the sample size is m, then the probability of obtaining
at most t A-students in the sample is (in Maple's notation)

F(t) = 1 - hypergeom([1, -S+t+1, -m+1+t],[t+2, N-S-m+t+2],1)* m! * S!
* (N-m)! * (N-S)!/((t+1)!*(S-t-1)!*(m-t-1)!*(N-S-m+t+1)!*N!)

For example, if N = 1000, S = 550 and m = 100,
F(49) is approximately 0.1220852217. If you used the binomial distribution
with p = 0.55, F(49) would be approximately .1345762132.

Another approximation would be to use the normal distribution with continuity
correction and the mean and standard deviation for the hypergeometric
distribution. The hypergeometric distribution for the number of A-students
in the sample has mean mu = m*S/N and standard deviation
sigma = sqrt(m*(S/N)*(1-S/N)*(N-m)/(N-1)); in this example
mu = 55 and sigma = sqrt(825/37). The normal approximation with
continuity correction is Phi((49.5 - mu)/sigma) = Phi(-1.164760348)
= 0.1220580068. So in this case it is a much better approximation than
the binomial.
--
Robert Israel israel(a)math.MyUniversitysInitials.ca
Department of Mathematics http://www.math.ubc.ca/~israel
University of British Columbia Vancouver, BC, Canada