(Non-)linearity and correlation (Pearson v. Spearman and Kendall)

by Bram Vanroy   Last Updated May 02, 2018 08:19 AM - source

I am a bit confused about how to interpret correlation coefficient results. I am aware that there are numerous questions about the differences between Pearson, Spearman, and Kendall, but I am more interested in their respective relationship to linearity.

Let's assume that Pearson's r is 0.578 and p is 0.000012. There is a correlation between two variables that is most probably not caused by chance. However, this assumes linearity (and homoscedasticity) and prone to errors when the data contains outliers.

Let's also assume that we draw a scatter plot and find that, indeed, there are outliers in our data. To minimise the effect of outliers, we run a Kendall test. Here we also find a small p and a positive tau. Kendall (and Spearman) do not assume linearity, hence their effectiveness when dealing with outliers. But what are the consequences for trying to fit the data on a line?

If we have normally distributed, linear data (ideal case for Pearson) we can fit all data points on a linear cure (cf. for instance regplot() of the Python package seaborn). But if we have outliers, and Pearson is not a viable option, is there still any assumption for linearity with Kendall or Spearman? Does it still make sense to try and fit the data on a linear curve, or any curve for that matter? Or does the relationship as defined by Kendall or Spearman does not say anything about the fitting of the data, meaning that it does not make sense to try and plot the data on a curve?

Related Questions

Correlation between 2 boolean variables

Updated November 22, 2017 05:19 AM