Fisher's exact test appropriate here? [SAS]

Prev: automatically reading other formats into SAS
Next: proc fcmp crashes

From: Ryan on 9 Jan 2010 21:23

On Jan 9, 7:25 pm, stringplaye...(a)YAHOO.COM (Dale McLerran) wrote:
> --- On Sat, 1/9/10, Ryan <ryan.andrew.bl...(a)GMAIL.COM> wrote:
>
>
>
>
>
> > From: Ryan <ryan.andrew.bl...(a)GMAIL.COM>
> > Subject: Re: Fisher's exact test appropriate here?
> > To: SA...(a)LISTSERV.UGA.EDU
> > Date: Saturday, January 9, 2010, 5:29 AM
> > On Jan 8, 7:51 pm, stringplaye...(a)YAHOO.COM
> > (Dale McLerran) wrote:
>
> > > Robert,
>
> > > Let me dispose of the statement about whether Fisher's exact
> > > test is using an approximation due to the large sample size.
> > > For a 2x2 table with row totals R1 and R2, column totals C1
> > > and C2, and cell frequencies f11, f12, f21, and f22, the
> > > Fisher exact test depends on the computation
>
> > > P = [ ( R1! * R2! * C1! * C2!) / n! ] /
> > > ( f11! * f12! * f21! * f22!)
>
> > > for different arrangements of the cell frequencies fij.
> > > Now, SAS certainly cannot compute all of these factorials
> > > for the size sample which you have. Note, that
>
> > > X! = GAMMA(X+1)
>
> > > where GAMMA(u) is the gamma function and that
>
> > > log(X!) = lGAMMA(X+1)
>
> > > where lGAMMA is the log gamma function. Now, taking
> > > logarithms, we have
>
> > > log(P) = log( [ ( R1! * R2! * C1! * C2!) / n! ] /
> > > ( f11! * f12! * f21! * f22!) )
>
> > > = log(R1!) + log(R2!) + log(C1!) + log(C2!) - log(n!) -
> > > log(f11!) - log(f12!) - log(f21!) - log(f22!)
>
> > > = lgamma(R1+1) + lgamma(R2+1) + lgamma(C1+1) + lgamma(C2+1) -
> > > lgamma(n+1) -
> > > lgamma(f11+1) - lgamma(f12+1) - lgamma(f21+1) -lgamma(f22+1)
>
> > > Now, what is really important for Fisher's exact test is
> > > not the value of P (or log(P)), but the value of P (log(P))
> > > for the observed table compared to other possible tables
> > > which retain the same marginal frequencies. To the extent
> > > that the computation of log(P) using the lgamma function
> > > retains order, then the computation of Fisher's exact test
> > > is not at all affected by the sample size. I would really
> > > expect that log(P) would at least retain order across all
> > > possible tables which have the specified marginal values.
> > > Thus, the value of the Fisher exact test should not be
> > > compromised at all.
>
> > > Now, as to whether the Fisher exact p-value is better than
> > > p-values based on distributional assumptions (normal, Poisson),
> > > I would think that it wouldn't much matter for the sample size
> > > that you have here. Certainly, for the values which you
> > > present in your post, the Fisher exact test, chi-square test,
> > > and Poisson model all produce nonsignificant p-values. There
> > > is some discrepancy in p-values for the three methods.
> > > However, since all of the methods indicate that the p-value
> > > is greater than 0.50, any discrepancy is of trivial
> > > importance.
>
> > > You might have some other variables which you want to test
> > > (or you might not have revealed correct data). As you get
> > > closer to p=0.05, I would place money that the p-values
> > > will become more and more similar. If you are in the
> > > uncomfortable position of having one test where p<0.05
> > > and another test where p>0.05, the interpretation is not
> > > really any different. Using p<0.05 is a rather arbitrary
> > > choice.
>
> > > Dale
>
> > > ---------------------------------------
> > > Dale McLerran
> > > Fred Hutchinson Cancer Research Center
> > > mailto: dmclerra(a)NO_SPAMfhcrc.org
> > > Ph: (206) 667-2926
> > > Fax: (206) 667-5977
> > > ---------------------------------------- Hide quoted
> > text -
>
> > > - Show quoted text -
>
> > Dale,
>
> > I computed P*log(P) for a 2X2 table with the following
> > cell
> > frequencies:
>
> > f11=8
> > f12=14
> > f21=75
> > f22=32
>
> > and obtained the value
>
> > P*log(P)=-0.015585312549042
>
> > Here are the formulas I used to compute P*log(P) in another
> > stats
> > program:
>
> > ---------------
>
> > log_p = lngamma(22+1) + lngamma(107+1) + lngamma(83+1) +
> > lngamma(46+1) - lngamma(129+1) -
> > lngamma(8+1) - lngamma(14+1) - lngamma(75+1) -
> > lngamma(32+1)
>
> > p_log_p = exp(log_p)*log_p
>
> > --------------
>
> > If you have the time, could you please tell me what I did incorrectly
> > and exactly what p_log_p represents?
>
> > Thanks,
>
> > Ryan
>
> Ryan,
>
> When I wrote "but the value of P (log(P))", you apparently
> interpreted that to mean that we would compute P*log(P).
> Previously in the same sentence, I had written "what is
> really important for Fisher's exact test is not the value
> of P (or log(P))". So, when I wrote "P (log(P))" in the
> same sentence, I meant for that to be interpreted as
> "P (or log(P))". But I was not really clear on that.
>
> Now, this P that we compute according to the formula specified
> above is NOT the Fisher's exact test p-value. Rather, it
> is a probability under multinomial sampling of the
> particular table with observed f11, f12, f21, and f22
> among all tables which have the observed marginal frequencies
> R1, R2, C1, and C2. Given those same fixed marginals, there
> could be other values f11~, f12~, f21~, and f22~ which
> could have been observed.
>
> The Fisher's exact test p-value is obtained by computing P
> for the observed table as well as P~ (where P~ is the value
> of P computed for f11~, f12~, f21~, and f22~) for all possible
> tables having the observed marginal frequencies. We then
> compare P against the distribution of P~.
>
> Note, though, that P and log(P) have a monotonic relationship
> so that we could also compare log(P) against the distribution
> of log(P~). For that matter, P*log(P) and P have a
> monotonic relationship. So, the statistic which you have
> computed above could be employed to construct the Fisher's
> exact test p-value if you compare the observed table value
> of P*log(P) against the distribution of P~*log(P~).
>
> Dale
>
> ---------------------------------------
> Dale McLerran
> Fred Hutchinson Cancer Research Center
> mailto: dmclerra(a)NO_SPAMfhcrc.org
> Ph: (206) 667-2926
> Fax: (206) 667-5977
> ---------------------------------------- Hide quoted text -
>
> - Show quoted text -

Dale,

Thank you for the clarification. I found a simple example online
demonstrating how to calculate Fisher's exact one tailed p value using
the factorial formula. Out of interest I decided to solve for Fisher's
exact one tailed p value using the formula with the lgamma function
you presented...

The observed frequencies table is:

-----
7 2
5 6
-----

The two stronger tables with the same marginal frequencies are:

------
8 1
4 7
-----

and

-----
9 0
3 8
-----

I solved for Fisher's exact one tailed p-value using the following
equations:

-----
log_p_obs = lngamma(9+1) + lngamma(11+1) + lngamma(12+1) + lngamma
(8+1) - lngamma(20+1) -
lngamma(7+1) - lngamma(2+1) - lngamma(5+1) -
lngamma(6+1)
p_obs = exp(log_p_obs)
-----
log_p_alt1 = lngamma(9+1) + lngamma(11+1) + lngamma(12+1) + lngamma
(8+1) - lngamma(20+1) -
lngamma(8+1) - lngamma(1+1) - lngamma(4+1) -
lngamma(7+1)
p_alt1 = exp(log_p_alt1)
-----
log_p_alt2 = lngamma(9+1) + lngamma(11+1) + lngamma(12+1) + lngamma
(8+1) - lngamma(20+1) -
lngamma(9+1) - lngamma(0+1) - lngamma(3+1) -
lngamma(8+1)
p_alt2 = exp(log_p_alt2)
-----
Fisher_onetailed_p = p_obs + p_alt1 + p_alt2
= 0.157
-----

I am fairly certain my calculations are correct. This was an
interesting exercise. Thanks again, Dale.

Ryan

From: xlr82sas on 9 Jan 2010 21:27

On Jan 9, 4:25 pm, stringplaye...(a)YAHOO.COM (Dale McLerran) wrote:
> --- On Sat, 1/9/10, Ryan <ryan.andrew.bl...(a)GMAIL.COM> wrote:
>
>
>
>
>
> > From: Ryan <ryan.andrew.bl...(a)GMAIL.COM>
> > Subject: Re: Fisher's exact test appropriate here?
> > To: SA...(a)LISTSERV.UGA.EDU
> > Date: Saturday, January 9, 2010, 5:29 AM
> > On Jan 8, 7:51 pm, stringplaye...(a)YAHOO.COM
> > (Dale McLerran) wrote:
>
> > > Robert,
>
> > > Let me dispose of the statement about whether Fisher's exact
> > > test is using an approximation due to the large sample size.
> > > For a 2x2 table with row totals R1 and R2, column totals C1
> > > and C2, and cell frequencies f11, f12, f21, and f22, the
> > > Fisher exact test depends on the computation
>
> > > P = [ ( R1! * R2! * C1! * C2!) / n! ] /
> > > ( f11! * f12! * f21! * f22!)
>
> > > for different arrangements of the cell frequencies fij.
> > > Now, SAS certainly cannot compute all of these factorials
> > > for the size sample which you have. Note, that
>
> > > X! = GAMMA(X+1)
>
> > > where GAMMA(u) is the gamma function and that
>
> > > log(X!) = lGAMMA(X+1)
>
> > > where lGAMMA is the log gamma function. Now, taking
> > > logarithms, we have
>
> > > log(P) = log( [ ( R1! * R2! * C1! * C2!) / n! ] /
> > > ( f11! * f12! * f21! * f22!) )
>
> > > = log(R1!) + log(R2!) + log(C1!) + log(C2!) - log(n!) -
> > > log(f11!) - log(f12!) - log(f21!) - log(f22!)
>
> > > = lgamma(R1+1) + lgamma(R2+1) + lgamma(C1+1) + lgamma(C2+1) -
> > > lgamma(n+1) -
> > > lgamma(f11+1) - lgamma(f12+1) - lgamma(f21+1) -lgamma(f22+1)
>
> > > Now, what is really important for Fisher's exact test is
> > > not the value of P (or log(P)), but the value of P (log(P))
> > > for the observed table compared to other possible tables
> > > which retain the same marginal frequencies. To the extent
> > > that the computation of log(P) using the lgamma function
> > > retains order, then the computation of Fisher's exact test
> > > is not at all affected by the sample size. I would really
> > > expect that log(P) would at least retain order across all
> > > possible tables which have the specified marginal values.
> > > Thus, the value of the Fisher exact test should not be
> > > compromised at all.
>
> > > Now, as to whether the Fisher exact p-value is better than
> > > p-values based on distributional assumptions (normal, Poisson),
> > > I would think that it wouldn't much matter for the sample size
> > > that you have here. Certainly, for the values which you
> > > present in your post, the Fisher exact test, chi-square test,
> > > and Poisson model all produce nonsignificant p-values. There
> > > is some discrepancy in p-values for the three methods.
> > > However, since all of the methods indicate that the p-value
> > > is greater than 0.50, any discrepancy is of trivial
> > > importance.
>
> > > You might have some other variables which you want to test
> > > (or you might not have revealed correct data). As you get
> > > closer to p=0.05, I would place money that the p-values
> > > will become more and more similar. If you are in the
> > > uncomfortable position of having one test where p<0.05
> > > and another test where p>0.05, the interpretation is not
> > > really any different. Using p<0.05 is a rather arbitrary
> > > choice.
>
> > > Dale
>
> > > ---------------------------------------
> > > Dale McLerran
> > > Fred Hutchinson Cancer Research Center
> > > mailto: dmclerra(a)NO_SPAMfhcrc.org
> > > Ph: (206) 667-2926
> > > Fax: (206) 667-5977
> > > ---------------------------------------- Hide quoted
> > text -
>
> > > - Show quoted text -
>
> > Dale,
>
> > I computed P*log(P) for a 2X2 table with the following
> > cell
> > frequencies:
>
> > f11=8
> > f12=14
> > f21=75
> > f22=32
>
> > and obtained the value
>
> > P*log(P)=-0.015585312549042
>
> > Here are the formulas I used to compute P*log(P) in another
> > stats
> > program:
>
> > ---------------
>
> > log_p = lngamma(22+1) + lngamma(107+1) + lngamma(83+1) +
> > lngamma(46+1) - lngamma(129+1) -
> > lngamma(8+1) - lngamma(14+1) - lngamma(75+1) -
> > lngamma(32+1)
>
> > p_log_p = exp(log_p)*log_p
>
> > --------------
>
> > If you have the time, could you please tell me what I did incorrectly
> > and exactly what p_log_p represents?
>
> > Thanks,
>
> > Ryan
>
> Ryan,
>
> When I wrote "but the value of P (log(P))", you apparently
> interpreted that to mean that we would compute P*log(P).
> Previously in the same sentence, I had written "what is
> really important for Fisher's exact test is not the value
> of P (or log(P))". So, when I wrote "P (log(P))" in the
> same sentence, I meant for that to be interpreted as
> "P (or log(P))". But I was not really clear on that.
>
> Now, this P that we compute according to the formula specified
> above is NOT the Fisher's exact test p-value. Rather, it
> is a probability under multinomial sampling of the
> particular table with observed f11, f12, f21, and f22
> among all tables which have the observed marginal frequencies
> R1, R2, C1, and C2. Given those same fixed marginals, there
> could be other values f11~, f12~, f21~, and f22~ which
> could have been observed.
>
> The Fisher's exact test p-value is obtained by computing P
> for the observed table as well as P~ (where P~ is the value
> of P computed for f11~, f12~, f21~, and f22~) for all possible
> tables having the observed marginal frequencies. We then
> compare P against the distribution of P~.
>
> Note, though, that P and log(P) have a monotonic relationship
> so that we could also compare log(P) against the distribution
> of log(P~). For that matter, P*log(P) and P have a
> monotonic relationship. So, the statistic which you have
> computed above could be employed to construct the Fisher's
> exact test p-value if you compare the observed table value
> of P*log(P) against the distribution of P~*log(P~).
>
> Dale
>
> ---------------------------------------
> Dale McLerran
> Fred Hutchinson Cancer Research Center
> mailto: dmclerra(a)NO_SPAMfhcrc.org
> Ph: (206) 667-2926
> Fax: (206) 667-5977
> ---------------------------------------- Hide quoted text -
>
> - Show quoted text -

Hi Dale,

see
http://en.wikipedia.org/wiki/Fisher%27s_exact_test

If I use your formula I get a p-value

0.0026221258207619

If I use proc freq for the two tailed p-value of the Fishers Exact
Test I get

0.0026221258207619

I think the Exact P-value for a 2x2 table is just a statistic of the
hypergeometric distribution. Your P is just a result of evaluating the
hypergeometric distribution.

Consider 2x2 table

a b
c d

p=(a + b)!(c+d)!(a + c)!(b + d)! / n!a!b!c!d!

The lgamma makes it easy to evaluate the factorials. Since a! = gamma(a
+1) we have all the plus ones.

From: Dale McLerran on 11 Jan 2010 01:37

--- On Sat, 1/9/10, xlr82sas <xlr82sas(a)AOL.COM> wrote:

> From: xlr82sas <xlr82sas(a)AOL.COM>
> Subject: Re: Fisher's exact test appropriate here?
> To: SAS-L(a)LISTSERV.UGA.EDU
> Date: Saturday, January 9, 2010, 6:27 PM
> On Jan 9, 4:25 pm, stringplaye...(a)YAHOO.COM
> (Dale McLerran) wrote:
> > --- On Sat, 1/9/10, Ryan <ryan.andrew.bl...(a)GMAIL.COM>
> wrote:
>
> Hi Dale,
>
> see
> http://en.wikipedia.org/wiki/Fisher%27s_exact_test
>
> If I use your formula I get a p-value
>
> 0.0026221258207619
>
> If I use proc freq for the two tailed p-value of the
> Fishers Exact
> Test I get
>
> 0.0026221258207619
>
> I think the Exact P-value for a 2x2 table is just a statistic of the
> hypergeometric distribution. Your P is just a result of evaluating the
> hypergeometric distribution.
>
> Consider 2x2 table
>
> a b
> c d
>
> p=(a + b)!(c+d)!(a + c)!(b + d)! / n!a!b!c!d!
>
> The lgamma makes it easy to evaluate the factorials. Since a! = gamma(a
> +1) we have all the plus ones.
>

Just to be clear, the value 0.0026... is the probability of
the observed table under the hypergeometric distribution.
However, the (two-tailed) p-value for Fisher's exact test
for the data which were given as

8 14
75 32

is p=0.0060...

The following code uses PROC FREQ to evaluate table probabilities
for all 2x2 tables which have the following structure:

f11 f12 22
f21 f22 107
83 46 129

If we add the table probabilities for all tables which have
f11<=8, then we get the "Left-sided Pr <= F" value which is
presented for the table with f11=8. Adding up the table
probabilities which are at least as extreme as the observed
table probability, we obtain the value shown in the row
"Two-sided Pr <= P".

data test;
do i=0 to 22;
f11 = i; f12=22-i; f21=83-i; f22=129-f11-f12-f21;
x=0; y=0; freq=f11; output;
x=0; y=1; freq=f12; output;
x=1; y=0; freq=f21; output;
x=1; y=1; freq=f22; output;
end;
keep i x y freq;
run;

proc freq data=test(where=(i=8));
weight freq;
tables x*y / chisq;
run;

ods listing close;
ods output FishersExact=Fisher(where=(name1="P_TABLE"));
proc freq data=test;
by i;
weight freq;
tables x*y / chisq;
run;

data _null_;
set Fisher(where=(i=8) rename=(nvalue1=P_observed));
do j=0 to 22;
pointer=j+1;
set Fisher point=pointer;
if j<=8 then left+nvalue1;
if nvalue1<=P_observed then pval_2tailed+nvalue1;
end;
put left= pval_2tailed=;
run;

Dale

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra(a)NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

First | Prev |
Pages: 1 2
Prev: automatically reading other formats into SAS
Next: proc fcmp crashes