Marketer often undertake large survey projects with little forethought about their approach to data analysis. Compounding this problem is their general lack of interest in cleaning the data they collect.
Data cleaning isn’t really optional. Without it your quantitative data may be tainted and your actions based on inaccurate information.
Identifying statistical outliers is a key part of data cleaning, and that’s what we’re going to cover here. We’ll discuss how we identify an outlier in relation to the study’s goals and the kind of data collected, and what to do with an outlier once identified (to omit it or leave it in your results).
Identifying Statistical Outliers in Your Survey Data
Data points that lie outside of the trend set by the majority of other values are typically easy to distinguish when the data is represented visually in a graph.
For example, the day you get 139 trial signups on your marketing site when the daily median is closer to 60 would be an obvious outlier, right?
But it’s tough to say without doing a little simple math first. [Notice that we didn’t use the average of 60 in the example; this is because an average can be manipulated by an outlier, and heavily if the sample is small.]
How to Calculate the Median
Start by taking your sample and ordering each observation from lowest to highest. As an example, we’ll stick with the trial signup hypothetical. In this case, we have a sample of 13 days and the signups from those days. After being re-arranged from smallest to largest, they look like this:
Day 1: 32
Day 2: 45
Day 3: 49
Day 4: 52
Day 5: 59
Day 6: 62
Day 7: 63 <-median
Day 8: 67
Day 9: 68
Day 10: 71
Day 11: 72
Day 12: 74
Day 13: 139
The median in this data set is Day 7 with a value of 63 trial signups. If you happen to have an even number of observations, the median would be the average of the two values closest to the middle. So now that we have the median for this sample, we’ll assign 63 as the variable Q2, which sits between variables Q1 and Q3 that define the upper and lower quartiles.
Q2 = 63
Calculate the Lower Quartile
Similar to the median (Q2) the lower quartile (Q1) is the middle observation of the lower half of the sample. With an even number of days (6) below the median, we’ll have to average days 3 and 4 (49 and 52 respectively). That makes our lower quartile (Q1) 50.5.
Q1 = 50.5
Calculate the Upper Quartile
Following the same steps, days 10 and 11 will have to be averaged (71 and 72 respectively). This gives us 71.5 for the Q3.
Q3 = 71.5
Calculate the Interquartile Range
The idea behind the interquartile is that once you know the distance between Q1 and Q3 (21 in this example), you can quickly identify boundaries known as ‘fences’ to sieve for statistical outliers. Observations that fall outside the inner fence are known as minor statistical outliers, while observations that also fall outside the outer fence are known as major statistical outliers.
Interquartile range: 21
There are two sets of fences – the inner fence and the outer fence. To calculate the inner fence, we multiply the interquartile by 1.5 and add the result to Q3 and subtract from Q1. To calculate the outer fence, we follow the same steps, but multiply by 3.
21 x 1.5 = 31.5
Q1-31.5 = 19, Q3+31.5 = 103
Inner fence = 19 to 103
21 x 3 = 63
Q1-63 = -12.5, Q3+63 = 134.5
Outer fence = -12.5 to 134.5
Now that we have our inner and outer fences, we can clearly see that the lowest of our observations, Day 1 with 32 signups, is well within the inner fence, and not considered an outlier. However, at our high end, Day 13 with 139 signups is well outside the inner fence and also outside the outer fence. This makes Day 13 a major outlier.
You’ve Identified the Statistical Outliers – Now What?
This is where a very objective process begins to take on a more subjective feel. Even though you’ve clearly labeled the observations that are statistical outliers within the data set, it isn’t a black and white issue whether you should omit or not omit an observation, especially considering it may be looked at as a form of data tampering.
Things to consider:
- Was the outlier caused by error? Human error, process error, calculation error, etc.: If an inaccuracy is to blame, omission is generally a good idea. If not, then it may provide valuable insight, and including it may prove important.
- Will the outlier’s inclusion skew the average? If so, it should probably be removed. If not, removing the outlier may be less crucial to conceiving an accurate picture.
There are several methods to determining statistical outliers, such as the Chauvenet’s criterion and Grubbs’ test. This is certainly not the only way to calculate an outlier, but if you need a simple and fast equation to determine an outlier with regards to the median and quartiles, the method outlined here will serve you well.