From: naimead on
Hello all,

Is it possible to calculate 95-percentile of a non-Gaussian distribution without having to define its exact distribution?

thank you very much,

naimead
From: ImageAnalyst on
Of course. Just use hist() to get the actual distribution, and then
cumsum() on the histogram to figure out when you've passed the 95%
point. No assumption of the actual mathematical form of the histogram
distribution is necessary.
From: Peter Perkins on
On 5/5/2010 11:09 AM, naimead wrote:
> Is it possible to calculate 95-percentile of a non-Gaussian distribution without having to define its exact distribution?

If you have access to the Statistics Toolbox,

>> help prctile
PRCTILE Percentiles of a sample.

>> help quantile
QUANTILE Quantiles of a sample.
From: Walter Roberson on
naimead wrote:

> Is it possible to calculate 95-percentile of a non-Gaussian distribution without having to define its exact distribution?

Maybe. In theory, the description of a distribution could be piecewise
and exactly describe at least the last 5 percent; in such a case you
could calculate the 95-percentile without knowing the distribution of
the rest.

In situations where the distribution is defined by a set of data rather
than by a formula, as you start at the beginning of the data and proceed
through it, you can narrow down the range of where the 95-percentile
would be. Can you locate the 95-percentile without examining all of the
data? _Sometimes_ you can: if individual data values can re-occur and
you are tracking and find that the bounds of the 95-percentile fall
entirely within a block of repeated data, then you can terminate early.
As a simple thought experiment on those lines, if your data set had 101
samples and you had examined 96 of them so far and had found they were
all the same, then you do not need to examine the remaining data
samples, as no matter what they are they would not be able to push the
95-percentile away from that repeated value. If, though, you had only
examined 95 of the 101 data samples so far, you cannot know whether the
values of those unexamined samples might happen to all be above the
repeated value: if they do happen to be, then the 95-percentile would be
one of them, whereas if even one of them was less than or equal to the
known repeated value, then the 95-percentile would be the repeated value
in this hypothetical situation.

In situations where the distribution is defined by an unknown formula...
I'm not sure... possibly if you had enough information _about_ the formula.

In situations where the distribution is defined by a known formula that
has one or more parameters whose values are not currently known:
_Sometimes_ you can. The known formula might be manipulable to calculate
the mean and standard deviation in terms of a relationship between the
unknown parameters, and you might know what the value of that
relationship is without knowing the parameters themselves. For example,
the mean and standard deviation might come down to the ratio of two
unknown parameters, and you might know their ratio without knowing their
exact values at the time of the calculation.
From: Walter Roberson on
ImageAnalyst wrote:
> Of course. Just use hist() to get the actual distribution, and then
> cumsum() on the histogram to figure out when you've passed the 95%
> point. No assumption of the actual mathematical form of the histogram
> distribution is necessary.

The problem statement did not indicate that there are a set of samples
that _define_ the distribution: it was a general enough question to
apply to distributions defined by a formula whose parameters are not
completely known.

If there are a set of samples, then one needs to know if the samples
define the distribution or have instead have been sampled from the
distribution. If the data has been sampled from a distribution, then you
cannot use a histogram of the samples in order to find the
95-percentile, other than perhaps probabilistically. "This poll is
accurate to within 3%, 19 times out of 20".