Normalization in correlation matrices

by Emilie   Last Updated July 12, 2019 08:19 AM - source

I am investigating the correlation factors for 13 categorical variables (1401 observations). To be able to build a correlation matrix, I attributed 1 or 0 to each variable depending on its level. I then constructed a matrix and calculated the correlation factors.

The problem is that some of my variables (events) occur very rarely compared to others and I think it gives me misleading results. I do not know how to account for the occurrence of each variable. Should I remove the most rare variables? Should I normalize the calculated correlation factor against the occurrence of the variable?

Thanks a lot, Emilie



Answers 1


So if I am reading your question correctly, you are getting a lot of {0} observations for some variables? This isn't a matter of misinformation or possible bias. If your variable illicits many false responses, say 95%, then this likely means that, assuming you have sufficient samples, the population P(0)≈ 0.95. It isn't missing data, just a false, or zero, binomial response.

You shouldn't have a problem with your correlation matrix. Lots of "zeroes" between categories just means that those variables have a low probability of occurrence (assuming you collected data correctly and set up the matrix accordingly) and that they are correlated in that they are both not correlating. Say for instance, you have a dataset of binomial occurrences of tsunamis, earthquakes (say above 7 mag), and tornadoes. This isn't a meteorology or seismology forum, but bear with me. Many tornadoes will happen without a single occurrence of the other two. Obviously, when a earthquake happens, tsunamis are very likely depending on the location. Despite having lots of "zeroes" from tsunamis and quakes, we have all the info we need; corr(ts, eq) will likely be closer to 1, while corr(ts, tor) and corr(eq,tor) will be virtually zero, barring any data being taken after the second coming of christ when all hell breaks loose.

Tanner
Tanner
July 12, 2019 07:52 AM

Related Questions




Fix dominant columns/rows in symmetric data matrix?

Updated February 20, 2019 06:19 AM


Doubly stochastic matrix (Sinkhorn normalization)

Updated November 14, 2017 13:19 PM