Calculating and visualising correlation coefficients with
inspectdf (and why correlations matrices make life hard)
In a previous post, we explored categorical data using the
In this post, we tackle a different exploratory problem of calculating
and visualising correlation coefficients. To install
CRAN, you’ll first need to run:
We’ll begin the tutorial by loading the
packages, the latter we’ll need for some dataframe manipulation.
For this walk-through, we’ll explore the
storms dataset which comes
dplyr package and has many numeric columns. The data includes
the positions and attributes of 198 tropical storms, measured every six
hours during the lifetime of a storm.
# check out the storms dataset ?storms
What’s wrong with
Most R users will be familiar with the built-in
cor() which can be used to produce a matrix of correlation
coefficients of pairs of numeric variables. So why not just use this?
Here’s a short list of pain points that occur when using this function:
cor() requires numeric inputs only
Correlations are only defined for numeric pairs of variables, so perhaps
this shouldn’t be a surprise. But it means we can’t simply pass a
dataframe with mixed types to
cor() and expect that it will be smart
enough to return correlations for just the numeric columns. Consequently
## Error in cor(storms): 'x' must be numeric
2. Correlation matrices are hard to read
It isn’t hard get what we want from
cor() by first selecting the
numeric columns using a bit of
cor(storms %>% select_if(is.numeric))
## year month day hour ## year 1.000000000 -0.011488006 0.0183703369 0.0015741629 ## month -0.011488006 1.000000000 -0.1830702018 -0.0051201358 ## day 0.018370337 -0.183070202 1.0000000000 0.0007164624 ## hour 0.001574163 -0.005120136 0.0007164624 1.0000000000 ## lat -0.121252667 -0.065922836 -0.0508598742 0.0026823666 ## long 0.060387523 0.048382680 0.0406477301 -0.0091876627 ## wind 0.048966015 0.126682358 -0.0064971154 0.0018333102 ## pressure -0.072615741 -0.134238300 -0.0010113895 0.0016030589 ## ts_diameter NA NA NA NA ## hu_diameter NA NA NA NA ## lat long wind pressure ## year -0.121252667 0.060387523 0.048966015 -0.072615741 ## month -0.065922836 0.048382680 0.126682358 -0.134238300 ## day -0.050859874 0.040647730 -0.006497115 -0.001011389 ## hour 0.002682367 -0.009187663 0.001833310 0.001603059 ## lat 1.000000000 -0.104014683 0.076141764 -0.103772744 ## long -0.104014683 1.000000000 0.004737422 0.058467333 ## wind 0.076141764 0.004737422 1.000000000 -0.942249266 ## pressure -0.103772744 0.058467333 -0.942249266 1.000000000 ## ts_diameter NA NA NA NA ## hu_diameter NA NA NA NA ## ts_diameter hu_diameter ## year NA NA ## month NA NA ## day NA NA ## hour NA NA ## lat NA NA ## long NA NA ## wind NA NA ## pressure NA NA ## ts_diameter 1 NA ## hu_diameter NA 1
The result is a matrix of pairwise correlations. There are several problems with this:
- Matrices are great for linear algebra but terrible for visual inspection. This particular matrix is wide and has been truncated and spread over multiple lines.
- It isn’t easy to tell which variables are most or least correlated by eye-balling this matrix, it’s a jumble of numbers and the row and column indices aren’t easy to follow.
- Nearly half of the output is totally unnecessary: correlation matrices are always symmetric, which means that you only need about half of what is printed.
- It’s tricky to do any further analysis of the coefficients in this format - a dataframe would be handy!
cor() doesn’t produce confidence intervals
If possible, we should try to interpret point estimates in the context of their sampling distribution, for example by considering a confidence interval.
Confidence intervals aren’t available using
cor(), although can be
cor.test(). A big draw back here is that intervals and
perform hypothesis tests can only be performed one at a time - we may
want this for many (or all) correlation coefficients.
cor.test() don’t provide visualisation methods out of the box
Tables are all very well, but it’s much easier to use graphics to visually interrogate correlations. There are many other packages that do help with this, but in general they use a matrix or grid plot with coloured cells to display correlations which are typically messy and difficult to read.
inspect_cor() to calculate correlations
inspect_cor() attempts to address some of the issues above. To
calculate correlations for the
storms data, simply run
storms %>% inspect_cor()
## # A tibble: 45 x 6 ## col_1 col_2 corr p_value lower upper ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 pressure wind -0.942 0. -0.945 -0.940 ## 2 hu_diameter pressure -0.842 0. -0.853 -0.831 ## 3 hu_diameter wind 0.774 0. 0.758 0.788 ## 4 hu_diameter ts_diameter 0.684 0. 0.663 0.704 ## 5 ts_diameter pressure -0.683 0. -0.703 -0.663 ## 6 ts_diameter wind 0.640 0. 0.617 0.662 ## 7 ts_diameter lat 0.301 1.25e-73 0.266 0.335 ## 8 day month -0.183 3.59e-76 -0.205 -0.161 ## 9 hu_diameter lat 0.164 1.59e-22 0.127 0.201 ## 10 ts_diameter month 0.139 1.67e-16 0.102 0.176 ## # … with 35 more rows
The result is tabular rather than a matrix. Together, the first two
columns contain the names of every unique pair of numeric columns, while
corr column contains the correlation coefficients. For example,
the first row says that the correlation between
about -0.942. The rows are arranged in descending order of the
absolute correlation - making it easy to see which pairs are most
p_value column contains p-values associated with the null
hypothesis that the true correlation coefficient is 0. The
upper columns contain the lower and upper reaches of a 95% confidence
interval. In this case, the confidence interval for the correlation
(-0.945, -0.940). The interval type
can be changing the
alpha argument in
inspect_cor(), for example 90%
confidence intervals can be generated using
show_plot() to visualise correlation coefficients
The dataframe of coefficients above is already a bit easier to handle
cor()’s matrix output. We can go further and visualise these
storms %>% inspect_cor() %>% show_plot()
Some key points
- Each row in the plot corresponds to a unique pair of numeric columns, the correlation coefficient is show as a black vertical line.
- The gray and pink bars around the coefficients are the confidence intervals.
- The gray bars are confidence intervals that straddle 0 (also shown by the long vertical dashed line) indicating that the true coefficient is not significantly different to 0.
A side note that is not specific to
inspect_cor() is that we should be
careful when interpreting the significance of individual coefficients
when there are many correlation coefficients overall. For example, if
alpha = 0.05 and all of the true coefficients are 0, we’d still expect
to see 1 in 20 significant coefficients just by chance.
show_plot() to visualise the correlation with a single feature
Another common exploratory step is to assess the linear association between possible predictor variables and a target variable, often as a precursor to regression analysis or building a predictive model.
As an example, suppose we’d like to see which features of a storm are
most strongly correlated with
wind, the maximum sustained wind speed
of the storm. We don’t need to calculate all correlation coefficients
for this (for big data sets this is time consuming), only the ones that
inspect_cor() this is also straightforward, by simply adding the
with_col = "wind":
storms %>% inspect_cor(with_col = "wind") %>% show_plot()
The strongest association here is with
pressure, the air pressure at
the storm’s center. I have very little meteorological experience but it
seems sensible that those should be strongly associated.