Why is it bad to use z-scores to detect outliers and why you should use median absolute deviation instead

Get source code for this RMarkdown script here.

Donate and become a patron: If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.

Consider the small set of numbers assigned to variable `scores`

below. Just by eyeballing the data, it looks like there’s one outlier: 1000

```
scores <- c(-3, 1, 3, 3, 6, 8, 10, 10, 1000)
mean(scores) # mean
```

```
[1] 115.3333
```

```
sd(scores) # sd
```

```
[1] 331.7778
```

```
boxplot(scores) # 1000 looks like an extreme outlier
```

What are the standard or z-scores for each value?

\[z_{i} = \frac{x_{i}-\overline{x}}{\sigma}\] where \(z_{i}\) is the z-score for a particular score, \(x_{i}\) is a particular score, \(\overline{x}\) is the mean of all scores, and \(\sigma\) is the standard deviation of all scores.

The equation above expressed in code:

```
scores_z <- (scores - mean(scores)) / sd(scores)
scores_z # 1000 has a z-score < 3
```

```
[1] -0.3566644 -0.3446082 -0.3385800 -0.3385800 -0.3295378 -0.3235097
[7] -0.3174816 -0.3174816 2.6664433
```

The z-score of the largest value is 2.6664433, which is relatively small, considering researchers tend to consider scores as outliers only if they have z-scores 3 or larger.

So we seem to have a problem here: By eyeballing the scores, we intuitively know that 1000 should be an outlier, but the z-score outlier detection approach suggests 1000 isn’t an outlierand we shouldn’t remove it Of course, you could set your exclusion criterion to “scores with z-scores 2.5 (rather than 3.0) or greater will be considered outliers”. But changing the criterion arbitrarily doesn’t address the main problem:

- extremely negative/positive scores bias the mean and standard deviation, affecting the resulting z-score

```
mean(scores) # original mean
```

```
[1] 115.3333
```

```
mean(scores[1:8]) # mean after excluding extreme value
```

```
[1] 4.75
```

```
sd(scores) # original sd
```

```
[1] 331.7778
```

```
sd(scores[1:8]) # sd after excluding extreme value
```

```
[1] 4.590363
```

To make it easier to compute z-scores, detect outliers, and remove extreme values (based on your cut-off, 1.96, 2.5 or 3 or whatever), I’ve created the function `outliersZ()`

, which is available in my `hausekeep`

package. When you run the function, it tells you how many outliers were detected and what they’ve been replaced by (default replaces them with `NA`

).

```
library(hausekeep)
outliersZ(scores, zCutOff = 3.0) # replace values with z-scores greater than ± 3
```

```
[1] -3 1 3 3 6 8 10 10 1000
```

```
outliersZ(scores, showZValues = TRUE) # show z values (default = FALSE)
```

```
[1] -0.38 -0.37 -0.36 -0.36 -0.35 -0.34 -0.34 -0.34 2.83
```

```
outliersZ(scores, zCutOff = 0.35) # note that default zCutOff is 1.96
```

```
[1] NA NA NA NA 6 8 10 10 NA
```

```
outliersZ(scores, replaceOutliersWith = -9999) # replace outlier values with -9999
```

```
[1] -3 1 3 3 6 8 10 10 -9999
```

For more information, type `?outliersZ`

in your console. If you want to see how the `outliersZ()`

function is defined, just type `outliersZ`

in your console and you’ll see the source code.

Instead of using z-scores to detect outliers (which is problematic for various reasons shown above), we can instead use a simple and robust alternative that isn’t influenced by extreme outlier values: median absolute deviaion (Leys et al. 2019, 2013). See Wikipedia article on median absolute deviation.

Before we can compute the median absolute deviation, we need the median:

```
scores_median <- median(scores)
scores_median
```

```
[1] 6
```

Next we subtract the median from each value and get the absolute values:

```
scores_median_absolute_deviation <- abs(scores - scores_median)
scores_median_absolute_deviation
```

```
[1] 9 5 3 3 0 2 4 4 994
```

Next we get the median of the median absolute deviations:

```
median_median_absolute_deviations <- median(scores_median_absolute_deviation)
median_median_absolute_deviations
```

```
[1] 4
```

Next, we can compute the median of the absolute deviations (MAD):

```
scores_mad <- median_median_absolute_deviations * 1.4826
scores_mad
```

```
[1] 5.9304
```

This value is conceptually equivalent to standard deviation, but is used when computing median.

And why 1.4826? It’s a constant linked to the assumption of normality of the data, disregarding the abnormality induced by outliers (Rousseeuw and Croux 1993).

You can actually easily compute the median of the absolute deviations by calling the `mad()`

function. I computed it step-by-step above to show you how it’s done. Typically, you can simply call the `mad()`

function and provide your raw values as input.

```
mad(scores) # save as our manually computed scores_mad
```

```
[1] 5.9304
```

Finally, you can compute how much each value deviated:

```
scores_deviation <- (scores - scores_median) / scores_mad
scores_deviation # note that the value of 1000 has a huge deviation of 167!
```

```
[1] -1.5176042 -0.8431134 -0.5058681 -0.5058681 0.0000000
[6] 0.3372454 0.6744908 0.6744908 167.6109537
```

Again, the idea is conceptually similar to computed z-scores: for each value, subtract the median from it, and divide by the median of the absolute deviations.

Using this robust approach, the largest value 1000 in our set of values (-3, 1, 3, 3, 6, 8, 10, 10, 1000) has a huge deviation of 167.6109537. Regardless of what criteria you use (2.0, 2.5, 3.0 are all common cut-offs you could use), 1000 is so large that it’ll have to be excluded (consistent with our intuition).

To help detect and remove outliers using this robust approach, I’ve created the function `outliersMAD()`

, which is available in my `hausekeep`

package. When you run the function, it tells you how many outliers were detected and what they’ve been replaced by (default replaces them with `NA`

).

```
library(hausekeep)
outliersMAD(scores, MADCutOff = 3.0) # replace values with deviations greater than ± 3
```

```
[1] -3 1 3 3 6 8 10 10 NA
```

```
outliersMAD(scores, showMADValues = TRUE) # show deviation (default = FALSE)
```

```
[1] -1.52 -0.84 -0.51 -0.51 0.00 0.34 0.67 0.67 167.61
```

```
outliersMAD(scores, MADCutOff = 0.6) # note that default cut-off value is 3.0
```

```
[1] NA NA 3 3 6 8 NA NA NA
```

```
outliersMAD(scores, replaceOutliersWith = -9999) # replace outlier values with -9999
```

```
[1] -3 1 3 3 6 8 10 10 -9999
```

For more information, type `?outliersMAD`

in your console. If you want to see how the `outliersMAD()`

function is defined, just type `outliersMAD`

in your console and you’ll see the source code.

Here’s how you would use the function typically:

```
scores_outliers_removed <- outliersMAD(scores)
scores_outliers_removed
```

```
[1] -3 1 3 3 6 8 10 10 NA
```

```
boxplot(scores_outliers_removed)
```

```
scores <- c(-5, -2, 4, 8, 55, 100)
```

Use z-score method with 1.96 as cut-off:

```
scores_removeoutliers_zscore <- outliersZ(scores) # uses 1.96 as default cutoff
scores_removeoutliers_zscore
```

```
[1] -5 -2 4 8 55 100
```

Use median absolute deviation method with 3.00 as cut-off:

```
scores_removeoutliers_mad <- outliersMAD(scores) # uses 3.00 as default cutoff
scores_removeoutliers_mad
```

```
[1] -5 -2 4 8 NA NA
```

```
par(mfrow = c(1, 3)) # set figure to plot 1 row, 3 columns
boxplot(scores, main = "raw values")
boxplot(scores_removeoutliers_zscore, main = "after z-score outlier removal")
boxplot(scores_removeoutliers_mad, main = "after mad outlier removal")
```

Support my work and become a patron here!

Leys, Christophe, Marie Delacre, Youri L. Mora, Daniël Lakens, and Christophe Ley. 2019. “How to Classify, Detect, and Manage Univariate and Multivariate Outliers, with Emphasis on Pre-Registration.” *International Review of Social Psychology* 32 (1). https://www.rips-irsp.com/articles/10.5334/irsp.289/.

Leys, Christophe, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. 2013. “Detecting Outliers: Do Not Use Standard Deviation Around the Mean, Use Absolute Deviation Around the Median.” *J. Exp. Soc. Psychol.* 49 (4): 764–66. https://www.sciencedirect.com/science/article/pii/S0022103113000668.

Rousseeuw, Peter J., and Christophe Croux. 1993. “Alternatives to the Median Absolute Deviation.” *Journal of the American Statistical Association* 88 (424): 1273–83.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hauselin/rtutorialsite, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Lin (2019, Sept. 7). Data science: Use median absolute deviation instead of z-score to detect outliers. Retrieved from https://hausetutorials.netlify.com/posts/2019-10-07-outlier-detection-with-median-absolute-deviation/

BibTeX citation

@misc{lin2019use, author = {Lin, Hause}, title = {Data science: Use median absolute deviation instead of z-score to detect outliers}, url = {https://hausetutorials.netlify.com/posts/2019-10-07-outlier-detection-with-median-absolute-deviation/}, year = {2019} }