I am a bit confused about how to interpret correlation coefficient results. I am aware that there are numerous questions about the differences between Pearson, Spearman, and Kendall, but I am more interested in their respective relationship to linearity.
Let's assume that Pearson's
0.000012. There is a correlation between two variables that is most probably not caused by chance. However, this assumes linearity (and homoscedasticity) and prone to errors when the data contains outliers.
Let's also assume that we draw a scatter plot and find that, indeed, there are outliers in our data. To minimise the effect of outliers, we run a Kendall test. Here we also find a small
p and a positive
tau. Kendall (and Spearman) do not assume linearity, hence their effectiveness when dealing with outliers. But what are the consequences for trying to fit the data on a line?
If we have normally distributed, linear data (ideal case for Pearson) we can fit all data points on a linear cure (cf. for instance
regplot() of the Python package
seaborn). But if we have outliers, and Pearson is not a viable option, is there still any assumption for linearity with Kendall or Spearman? Does it still make sense to try and fit the data on a linear curve, or any curve for that matter? Or does the relationship as defined by Kendall or Spearman does not say anything about the fitting of the data, meaning that it does not make sense to try and plot the data on a curve?