From: Samoline1 Linke on
"Wayne King" <wmkingty(a)gmail.com> wrote in message <hbpr3j$p0r$1(a)fred.mathworks.com>...
> "Samoline1 Linke" <fixed-term.jehandad.khan(a)de.bosch.com> wrote in message <hbpo3u$4qb$1(a)fred.mathworks.com>...
> > "Wayne King" <wmkingty(a)gmail.com> wrote in message <hbpk0q$9pj$1(a)fred.mathworks.com>...
> > > "Samoline1 Linke" <fixed-term.jehandad.khan(a)de.bosch.com> wrote in message <hbovqb$5s6$1(a)fred.mathworks.com>...
> > > > "Wayne King" <wmkingty(a)gmail.com> wrote in message <hbnbmv$gft$1(a)fred.mathworks.com>...
> > > > > "Samoline1 Linke" <fixed-term.jehandad.khan(a)de.bosch.com> wrote in message <hbn8e0$31a$1(a)fred.mathworks.com>...
> > > > > > hi,
> > > > > >
> > > > > > as mentioned in example (load count.dat) of topic 'Removing Outliers'
> > > > > >
> > > > > > The script written for the example count.dat shows that the values more than mean+-3 std of the any vector will be removed.
> > > > > >
> > > > > > but after removing the outliers if you do boxplot(count) you can still see the outliers in the second box.
> > > > > >
> > > > > > Why is that so?
> > > > > Hi, outliers on a boxplot are defined to be greater than 1.5*IQR above the upper quartile, or less than 1.5*IQR below the lower quartile.
> > > > >
> > > > > load count.dat
> > > > > % just focusing on outliers above the mean
> > > > > UpperLimit = mean(count)+3*std(count);
> > > > > % returns 108.1109 170.7588 269.6676
> > > > > % but
> > > > > quantile(count,.75)+1.5*iqr(count) % requires stats toolbox
> > > > > % returns 87.5000 134.2500 205.7500
> > > > >
> > > > > so it's possible to have "outliers" on a boxplot that are not more than 3 standard deviations above (or below) the mean.
> > > > >
> > > > > Hope that helps,
> > > > > wayne
> > > >
> > > > -------------------------
> > > >
> > > > What you are saying is correct but still I am asking something else. I am saying
> > > >
> > > > mean (count)
> > > >
> > > > 32.0000 46.5417 65.5833
> > > >
> > > > std (count)
> > > >
> > > > 25.3703 41.4057 68.0281
> > > >
> > > > so technically speaking any point for first column should be removed which is above or below (32 + 25.3703 (3) , 32 - 25.3703 (3) ) or (108.1, -12.1) .
> > > >
> > > > But you see after removing outlier still you can see in the boxplot that for the first column there exists an outlier at Row 20 and its value is 114.
> > > >
> > > > My question is, why this point was not removed as an outlier because it is >108.1.
> > >
> > > Hi, I'm not sure how you "removed" the outlier, but I don't see that it shows up as an outlier in the boxplot. To take your example,
> > >
> > > load count.dat
> > > countCol1 =count(:,1); % just get 1st column to follow your example
> > > boxplot(countCol1) % you see the "outlier"
> > > indices = find(countCol1>108.1); %108.1 mean plus 3 std
> > > countCol1(indices) = [];
> > > boxplot(countCol1) % "outlier" is gone
> > >
> > > Perhaps you didn't actually remove it?
> > >
> > > Hope that helps,
> > > wayne
> >
> > -----------------------------------------------
> >
> > Thanks for explaining in such a detail but lemme tell you that I followed the same method as mentioned by Matlab help.
> >
> > may be i type it here...
> >
> > mu = mean (count)
> >
> > sigma = std (count)
> >
> > [n,p] = size(count)
> >
> > Meanmat = repmat (mu , n , 1)
> >
> > Sigmamat = repmat (sigma, n, 1)
> >
> > outliers = abs (count - Meanmat) > 3* Sigmamat
> >
> > nout = sum (outliers) % shows how many outliers each column has
> >
> > count( any( outliers, 2), :) = [] % removes the entire row which is good
> >
> >
> > This you can find by typing 'Removing Outliers' in help section.
> >
> > Thanks for taking interest. I hope we find the root cause
>
> Hi, the answer lies in my first response, if you execute the code:
>
> mu = mean (count);
> sigma = std (count);
> [n,p] = size(count);
> Meanmat = repmat (mu , n , 1);
> Sigmamat = repmat (sigma, n, 1);
> outliers = abs (count - Meanmat) > 3* Sigmamat ;
> count( any( outliers, 2), :) = [];
> boxplot(count)
>
> The only "outliers" that appear in the boxplot are in the 2nd column of count, BUT these are NOT outliers if the definition of outlier is greater than 3 standard deviations above the mean. They ARE outliers if your definition of outlier is greater than upperquartile+1.5*IQR. Remember for the 2nd column of data, the mean plus 3 standard deviations is 170.7588.
>
> If you look at the 2nd column of count, there are no values greater than 170.7588, so if you try to remove data values that exceed 3 standard deviations, you remove nothing. However, the upper quartile + 1.5*IQR for column two is 134.2500, so two values exceed that. The boxplot shows these as outliers.
>
> wayne#


------------

Thanx Wayne...It seems you are correct. I dont know why I was considering it as greater than mean + 3*std.. But according to your definition, it seems to be true.

Another question, which def. for outliers is correct? in matlab they normally remove mean+- 3*std but you say it considers the interquartile range for the outliers. I think its safer to consider the first one as outliers because otherwise you might remove many points as outliers which actually are not... right??