From: Ray Koopman on 4 Oct 2009 17:18 On Oct 4, 12:19 pm, James Waldby <n...(a)no.no> wrote: > On Sun, 04 Oct 2009 11:18:53 -0700, Ray Koopman wrote: >> [reply cross-posted to sci.stat.consult] > >> On Oct 4, 9:35 am, James Waldby <n...(a)no.no> wrote: >>> On Sun, 04 Oct 2009 08:59:32 -0400, federico wrote: >>> >>> [Re his earlier post of 04 Oct 2009 04:15:19 -0400 in which he wrote, >>>> In brief, I have a simulator ... from which I observe 2 correlated >>>> random variables, called "Events" and "Failures" (the meaning of >>>> "Events" is "total requests submitted to the system" while "Failures" >>>> counts the number of requests which experienced a failure (i.e. >>>> requests which have not been successfully served by the system)... >>>> Since I have to compare such [data] with the value calculated by >>>> another tool (to whom of course I submit the same system), I'd like to >>>> say, with statistical evidence, whether the results provided by the 2 >>>> tools are similar or not. >>> ] >>>> Hi hagman, and thank you so much for your reply.. Maybe I didn't >>>> explain well the problem: each run is an execution of the system, >>>> independent from the other ones.. So, for example, with 10 runs I got >>>> the following sample: >>>> >>>> Fail. Events Fail/Events Reliability >>>> 224 1956 0,114519427 0,885480573 >>>> 217 1950 0,111282051 0,888717949 >>>> 192 1976 0,097165992 0,902834008 >>>> 196 1966 0,099694812 0,900305188 >>>> 190 1935 0,098191214 0,901808786 >>>> 196 1937 0,101187403 0,898812597 [...] >>>> >>>> The mean for reliability, over the 10 samples, is 0,896434285. >>>> The other tool I mentioned provided me with a value of 0.902 for >>>> reliability: I'd like to say, in a non-empiric way, that the two >>>> tools provided a statistically similar result.. >>> >>> It probably would be better to randomly create n different system >>> settings, S1 ... Sn, and for each Si, run the system once in each of >>> the two simulators, and then use a paired t-test with n-1 degrees of >>> freedom (see <http://en.wikipedia.org/wiki/Paired_difference_test >>> and <http://en.wikipedia.org/wiki/Statistical_hypothesis_testing>). >>> >>> Usual t-test assumptions are that the random variables are normally >>> distributed, but as n grows larger (eg n > 30) that provision becomes >>> less important. >> >> Yes, the same simulator settings should be used for both tools, so that >> a paired-data analysis can be done. However, confidence intervals for >> the reliabilities and for the differences in reliabilities would be more >> to the point than would a t-test on the mean difference. And a simple >> scatterplot of the reliability pairs also would be informative. (Most >> statisticians would probably do the scatterplot first, as a diagnostic.) > > Yes, a plot should be made. However, the OP wrote, "I'd like to say, in > a non-empiric way, that the two tools provided a statistically similar > result". Whatever "non-empirical statistical similarity" might be, it sounds oxymoronic to me. > You apparently see this as a question like "Are the confidence > intervals different?", while I think the more basic question, "Are the > means different?" should be answered first. If there is a statistically > significant difference in means, further statistics might be useless > except as a guide to correcting simulation-model errors. I suggested CIs for the reliabilities themselves only for completeness. What he really needs is a CI for the mean difference, which I also suggested. As with any CI, it provides more information than a significance test: it tells the user the range of values within which the true mean difference might reasonably be taken to lie. (A hypothesis test would reject any value that is not in the interval, and would not reject any value that is in the interval.) If the CI does not contain zero then the hypothesis of equal reliabilities would be rejected, but the two methods might still be considered to have "similar" reliabilities if the entire interval is (subjectively) close enough to zero. On the other hand, if the CI does contain zero then the hypothesis of equal reliabilities could not be rejected, but the interval might be so wide that the assertion of "similar" reliabilities would not be supportable.
From: Rich Ulrich on 4 Oct 2009 17:47 On Sun, 4 Oct 2009 11:18:53 -0700 (PDT), Ray Koopman <koopman(a)sfu.ca> wrote: >[reply cross-posted to sci.stat.consult] > >On Oct 4, 9:35 am, James Waldby <n...(a)no.no> wrote: >> On Sun, 04 Oct 2009 08:59:32 -0400, federico wrote: >> >> [Re his earlier post of 04 Oct 2009 04:15:19 -0400 in which he wrote, >> >>> In brief, I have a simulator ... from which I >>> observe 2 correlated random variables, called "Events" and "Failures" >>> (the meaning of "Events" is "total requests submitted to the system" >>> while "Failures" counts the number of requests which experienced a >>> failure (i.e. requests which have not been successfully served by the >>> system)... >>> Since I have to compare such [data] with the value >>> calculated by another tool (to whom of course I submit the same system), >>> I'd like to say, with statistical evidence, whether the results provided >>> by the 2 tools are similar or not. >> ] >>> Hi hagman, and thank you so much for your reply.. Maybe I didn't explain >>> well the problem: each run is an execution of the system, independent >>> from the other ones.. So, for example, with 10 runs I got the following >>> sample: >> >>> Fail. Events Fail/Events Reliability >>> 224 1956 0,114519427 0,885480573 >>> 217 1950 0,111282051 0,888717949 >>> 192 1976 0,097165992 0,902834008 >>> 196 1966 0,099694812 0,900305188 >>> 190 1935 0,098191214 0,901808786 >>> 196 1937 0,101187403 0,898812597 >>> 202 1988 0,101609658 0,898390342 >>> 192 1908 0,100628931 0,899371069 >>> 192 1836 0,104575163 0,895424837 >>> 206 1927 0,10690192 0,89309808 >> >>> The mean for reliability, over the 10 samples, is 0,896434285. The other >>> tool I mentioned provided me with a value of 0.902 for reliability: I'd >>> like to say, in a non-empiric way, that the two tools provided a >>> statistically similar result.. I spent a few minutes wondering what "reliabilty" might be ... It might be easier to notice if you reported a useful 3 digits of accuracy -- the Reliability is nothing but (1-F) for the rate of Failures. So, of the 10 Runs reported above, one of them matches the *mean* of the other sample, and 9 of them are worse. Statistically, they probably differ in a test of means, too. However, it is also interesting to note that the rate of Failures is *more* consistent than one would expect by chance, if these were assumed to be "failures" under a binomial probability. What this seems to imply is that the Failures are probably unevenly distributed between some "types" of events -- whether you have previously given them "types" or not. If the Events are the same in both cases, it would be useful to match them up, and see how many of Fail-1 are also Fail-2, and how many are not, etc., for the 4 possibilities. - If they are matched by event, you might be able to say whether the methods (say) differ in that Method-x labels a 20 or so additional instances. >> >> It probably would be better to randomly create n different system >> settings, S1 ... Sn, and for each Si, run the system once in each of >> the two simulators, and then use a paired t-test with n-1 degrees of >> freedom (see <http://en.wikipedia.org/wiki/Paired_difference_test> >> and <http://en.wikipedia.org/wiki/Statistical_hypothesis_testing>). >> I think this Reply is suggesting that you adjust settings between *runs* -- so that you have 10 runs that come up with different rates of failure. That could be useful, too. That would be in the nature of bench-marking on different sorts of tasks. But if every potential Event can be matched between methods, that should be preferable, and more informative. >> Usual t-test assumptions are that the random variables are >> normally distributed, but as n grows larger (eg n > 30) that >> provision becomes less important. > >Yes, the same simulator settings should be used for both tools, so >that a paired-data analysis can be done. However, confidence intervals >for the reliabilities and for the differences in reliabilities would >be more to the point than would a t-test on the mean difference. And a >simple scatterplot of the reliability pairs also would be informative. >(Most statisticians would probably do the scatterplot first, as a >diagnostic.) -- Rich Ulrich
From: Michael on 4 Oct 2009 14:10 Musatov e:> On Oct 4, 12:19 pm, James Waldby <n...(a)no.no> wrote: > > On Sun, 04 Oct 2009 11:18:53 -0700, Ray Koopman > wrote: > >> [reply cross-posted to sci.stat.consult] > > > >> On Oct 4, 9:35 am, James Waldby <n...(a)no.no> > wrote: > >>> On Sun, 04 Oct 2009 08:59:32 -0400, federico > wrote: > >>> > >>> [Re his earlier post of 04 Oct 2009 04:15:19 > -0400 in which he wrote, > >>>> In brief, I have a simulator ... from which I > observe 2 correlated > >>>> random variables, called "Events" and "Failures" > (the meaning of > >>>> "Events" is "total requests submitted to the > system" while "Failures" > >>>> counts the number of requests which experienced > a failure (i.e. > >>>> requests which have not been successfully served > by the system)... > >>>> Since I have to compare such [data] with the > value calculated by > >>>> another tool (to whom of course I submit the > same system), I'd like to > >>>> say, with statistical evidence, whether the > results provided by the 2 > >>>> tools are similar or not. > >>> ] > >>>> Hi hagman, and thank you so much for your > reply.. Maybe I didn't > >>>> explain well the problem: each run is an > execution of the system, > >>>> independent from the other ones.. So, for > example, with 10 runs I got > >>>> the following sample: > >>>> > >>>> Fail. Events Fail/Events Reliability > >>>> 224 1956 0,114519427 0,885480573 > >>>> 217 1950 0,111282051 0,888717949 > >>>> 192 1976 0,097165992 0,902834008 > >>>> 196 1966 0,099694812 0,900305188 > >>>> 190 1935 0,098191214 0,901808786 > >>>> 196 1937 0,101187403 0,898812597 > [...] > >>>> > >>>> The mean for reliability, over the 10 samples, > is 0,896434285. > >>>> The other tool I mentioned provided me with a > value of 0.902 for > >>>> reliability: I'd like to say, in a non-empiric > way, that the two > >>>> tools provided a statistically similar result.. > >>> > >>> It probably would be better to randomly create n > different system > >>> settings, S1 ... Sn, and for each Si, run the > system once in each of > >>> the two simulators, and then use a paired t-test > with n-1 degrees of > >>> freedom (see > <http://en.wikipedia.org/wiki/Paired_difference_test > >>> and > <http://en.wikipedia.org/wiki/Statistical_hypothesis_t > esting>). > >>> > >>> Usual t-test assumptions are that the random > variables are normally > >>> distributed, but as n grows larger (eg n > 30) > that provision becomes > >>> less important. > >> > >> Yes, the same simulator settings should be used > for both tools, so that > >> a paired-data analysis can be done. However, > confidence intervals for > >> the reliabilities and for the differences in > reliabilities would be more > >> to the point than would a t-test on the mean > difference. And a simple > >> scatterplot of the reliability pairs also would be > informative. (Most > >> statisticians would probably do the scatterplot > first, as a diagnostic.) > > > > Yes, a plot should be made. However, the OP wrote, > "I'd like to say, in > > a non-empiric way, that the two tools provided a > statistically similar > > result". > > Whatever "non-empirical statistical similarity" might > be, it sounds > oxymoronic to me. > > > You apparently see this as a question like > "Are the confidence > > intervals different?", while I think the more basic > question, "Are the > > means different?" should be answered first. If > there is a statistically > > significant difference in means, further statistics > might be useless > > except as a guide to correcting simulation-model > errors. > > I suggested CIs for the reliabilities themselves only > for > completeness. What he really needs is a CI for the > mean difference, > which I also suggested. As with any CI, it provides > more information > than a significance test: it tells the user the range > of values within > which the true mean difference might reasonably be > taken to lie. (A > hypothesis test would reject any value that is not in > the interval, > and would not reject any value that is in the > interval.) If the CI > does not contain zero then the hypothesis of equal > reliabilities would > be rejected, but the two methods might still be > considered to have > "similar" reliabilities if the entire interval is > (subjectively) close > enough to zero. On the other hand, if the CI does > contain zero then > the hypothesis of equal reliabilities could not be > rejected, but the > interval might be so wide that the assertion of > "similar" > reliabilities would not be supportable. IRT based reliabilities for the five scales range from .83 to .91. Q.E.D.
From: Michael on 4 Oct 2009 14:14 Musatov wrote [http://meami.org] == [link] > On Sun, 4 Oct 2009 11:18:53 -0700 (PDT), Ray Koopman > <koopman(a)sfu.ca> > wrote: > > >[reply cross-posted to sci.stat.consult] > > > >On Oct 4, 9:35 am, James Waldby <n...(a)no.no> wrote: > >> On Sun, 04 Oct 2009 08:59:32 -0400, federico > wrote: > >> > >> [Re his earlier post of 04 Oct 2009 04:15:19 -0400 > in which he wrote, > >> > >>> In brief, I have a simulator ... from which I > >>> observe 2 correlated random variables, called > "Events" and "Failures" > >>> (the meaning of "Events" is "total requests > submitted to the system" > >>> while "Failures" counts the number of requests > which experienced a > >>> failure (i.e. requests which have not been > successfully served by the > >>> system)... > >>> Since I have to compare such [data] with the > value > >>> calculated by another tool (to whom of course I > submit the same system), > >>> I'd like to say, with statistical evidence, > whether the results provided > >>> by the 2 tools are similar or not. > >> ] > >>> Hi hagman, and thank you so much for your reply.. > Maybe I didn't explain > >>> well the problem: each run is an execution of the > system, independent > >>> from the other ones.. So, for example, with 10 > runs I got the following > >>> sample: > >> > >>> Fail. Events Fail/Events Reliability > >>> 224 1956 0,114519427 0,885480573 > >>> 217 1950 0,111282051 0,888717949 > >>> 192 1976 0,097165992 0,902834008 > >>> 196 1966 0,099694812 0,900305188 > >>> 190 1935 0,098191214 0,901808786 > >>> 196 1937 0,101187403 0,898812597 > >>> 202 1988 0,101609658 0,898390342 > >>> 192 1908 0,100628931 0,899371069 > >>> 192 1836 0,104575163 0,895424837 > >>> 206 1927 0,10690192 0,89309808 > >> > >>> The mean for reliability, over the 10 samples, is > 0,896434285. The other > >>> tool I mentioned provided me with a value of > 0.902 for reliability: I'd > >>> like to say, in a non-empiric way, that the two > tools provided a > >>> statistically similar result.. > > I spent a few minutes wondering what "reliabilty" > might be ... > It might be easier to notice if you reported a useful > 3 digits > of accuracy -- the Reliability is nothing but (1-F) > for the rate > of Failures. > > So, of the 10 Runs reported above, one of them > matches the > *mean* of the other sample, and 9 of them are worse. > Statistically, they probably differ in a test of > means, too. > > However, it is also interesting to note that the rate > of Failures > is *more* consistent than one would expect by > chance, if these > were assumed to be "failures" under a binomial > probability. What > this seems to imply is that the Failures are probably > unevenly > distributed between some "types" of events -- > whether you have > previously given them "types" or not. > > If the Events are the same in both cases, it would be > useful > to match them up, and see how many of Fail-1 are > also Fail-2, > and how many are not, etc., for the 4 possibilities. > - If they are > matched by event, you might be able to say whether > the methods > (say) differ in that Method-x labels a 20 or so > additional instances. > > >> > >> It probably would be better to randomly create n > different system > >> settings, S1 ... Sn, and for each Si, run the > system once in each of > >> the two simulators, and then use a paired t-test > with n-1 degrees of > >> freedom (see > <http://en.wikipedia.org/wiki/Paired_difference_test> > >> and > <http://en.wikipedia.org/wiki/Statistical_hypothesis_t > esting>). > >> > > I think this Reply is suggesting that you adjust > settings between > *runs* -- so that you have 10 runs that come up with > different > rates of failure. That could be useful, too. That > would be in the > nature of bench-marking on different sorts of tasks. > > But if every potential Event can be matched between > methods, > that should be preferable, and more informative. > > > >> Usual t-test assumptions are that the random > variables are > >> normally distributed, but as n grows larger (eg n > > 30) that > >> provision becomes less important. > > > >Yes, the same simulator settings should be used for > both tools, so > >that a paired-data analysis can be done. However, > confidence intervals > >for the reliabilities and for the differences in > reliabilities would > >be more to the point than would a t-test on the mean > difference. And a > >simple scatterplot of the reliability pairs also > would be informative. > >(Most statisticians would probably do the > scatterplot first, as a > >diagnostic.) > > -- > Rich Ulrich Performance Matched Discretionary Accrual Measures more importantly, does not accommodate potential non-linearity in the relation between the event conditions simulated in table 3, the performance matched http://web.mit.edu/kothari/www/attach/KLW 2002.pe --Musatov
First
|
Prev
|
Pages: 1 2 Prev: Q:About user profile ratings in sci-math google. Next: Attention UTORONTO: Pi(x) Essentialized |