# Is it reasonable to represent mean value without removal of the outliers?

by Sabbir Ahmed   Last Updated October 09, 2019 15:19 PM - source

At first, I want to say, I know that mean is outlier sensitive.

Problem

I have to talk about the lifestyle of the students, where the data is quantitative. Let's say I have to talk, how much time a student spend in the classroom (A), how much time a student spend in library (B) etc. Some of the data set (e.g. A) contains outlier. Since I believe that extreme users are representative of the population who does/uses extremely, I do not want to remove outliers and I want to write about the lifestyle of the students using the mean value without removal of outliers.

What creates the problem?

This paper says mean is outlier sensitive and many other articles says, you should remove outliers when calculating mean. However, I have never read a research paper where researchers have talked using median value, at time of talking lifestyle. For instance: I have never read a paper where researchers said "students spend 1 hour in the library" where 1 hour was median value.

My Question

I will show median values for each type of data (e.g. spending time in library, spending time in classroom) along with mean value in the table. Without removal of outliers, will it reasonable if I talk about the lifestyle of the students in everywhere except the table using only mean value?

Tags :

First off, kudos on recognising that outliers should be retained!

You seem to be most interested in how to characterize the population (students). Means and medians are both commonly used to convey what a 'typical value' of the population (loosely speaking) is. If the distribution is highly skewed and the mean does not capture a 'typical value' well, you may not want to use it (though there are exceptions to this). In this case, it's quite reasonable to use the median instead. Alternatively, you can use the mean along with confidence intervals (which can be asymmetric and convey a skewed distribution). Another common strategy that may be less useful to you in this specific case is to transform (e.g. take the log) the data so that it becomes less skewed; this is frequently a good way to plot data that is highly skewed.

Bottom line: it depends on what you want to communicate. In your case, I would use the median, but be sure to make this clear so your audience understands what is being discussed.

mkt
October 09, 2019 14:50 PM

Removal of outliers for computing the mean is certainly not mandatory, and may actually in some cases not make much of a difference (if the outliers are not that far out, and their percentage is small). Surely removal of outliers is more sensible if there are reasons to believe that these observations are erroneous, but you seem to be sure that yours are not, which is a good reason not to remove them.

If in fact the mean is strongly influenced by the outliers, I think it's better to comment using more than one number to get the whole picture. Using mean and median is one option, but maybe in your case it is even more sensible to say something like "the mean is 3 hours, in fact the vast majority is below 3.5 hours, but some 5% spend more than 7 hours per day XXX".

Lewian
October 09, 2019 15:02 PM