Illustrative data: starwars

The examples below make use of the starwars and storms data from the dplyr package

# some example data
data(starwars, package = "dplyr")
data(storms, package = "dplyr")

For illustrating comparisons of dataframes, use the starwars data and produce two new dataframes star_1 and star_2 that randomly sample the rows of the original and drop a couple of columns.

library(dplyr)
star_1 <- starwars %>% sample_n(50)
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)

inspect_cor() for a single dataframe

inspect_cor() returns a tibble containing Pearson’s correlation coefficient, confidence intervals and \(p\)-values for pairs of numeric columns . The function combines the functionality of cor() and cor.test() in a more convenient wrapper.

## # A tibble: 45 × 7
##    col_1                        col_2       corr   p_value  lower  upper pcnt_…¹
##    <chr>                        <chr>      <dbl>     <dbl>  <dbl>  <dbl>   <dbl>
##  1 pressure                     wind      -0.944 0         -0.946 -0.942   100  
##  2 hurricane_force_diameter     pressure  -0.821 0         -0.830 -0.812    45.1
##  3 hurricane_force_diameter     wind       0.754 0          0.742  0.765    45.1
##  4 hurricane_force_diameter     tropical…  0.681 0          0.666  0.695    45.1
##  5 tropicalstorm_force_diameter pressure  -0.678 0         -0.692 -0.663    45.1
##  6 tropicalstorm_force_diameter wind       0.625 0          0.608  0.641    45.1
##  7 tropicalstorm_force_diameter lat        0.318 1.66e-119  0.294  0.342    45.1
##  8 hurricane_force_diameter     lat        0.190 8.83e- 44  0.164  0.215    45.1
##  9 day                          month     -0.174 5.59e- 80 -0.191 -0.156   100  
## 10 tropicalstorm_force_diameter month      0.171 8.89e- 36  0.145  0.197    45.1
## # … with 35 more rows, and abbreviated variable name ¹​pcnt_nna
## # ℹ Use `print(n = ...)` to see more rows

A plot showing point estimate and confidence intervals is printed when using the show_plot() function. Note that intervals that straddle the null value of 0 are shown in gray:

Notes:

  • The tibble is sorted in descending order of the absolute coefficient \(|\rho|\).
  • inspect_cor drops missing values prior to calculation of each correlation coefficient.
  • The p_value is associated with the null hypothesis \(H_0: \rho = 0\).

inspect_cor() for two dataframes

When a second dataframe is provided, inspect_cor() returns a tibble that compares correlation coefficients of the first dataframe to those in the second. The p_value column contains a measure of evidence for whether the two correlation coefficients are equal or not.

inspect_cor(storms, storms[-c(1:200), ])
## # A tibble: 45 × 5
##    col_1                        col_2                      corr_1 corr_2 p_value
##    <chr>                        <chr>                       <dbl>  <dbl>   <dbl>
##  1 pressure                     wind                       -0.944 -0.944   0.914
##  2 hurricane_force_diameter     pressure                   -0.821 -0.821   1    
##  3 hurricane_force_diameter     wind                        0.754  0.754   1    
##  4 hurricane_force_diameter     tropicalstorm_force_diame…  0.681  0.681   1    
##  5 tropicalstorm_force_diameter pressure                   -0.678 -0.678   1    
##  6 tropicalstorm_force_diameter wind                        0.625  0.625   1    
##  7 tropicalstorm_force_diameter lat                         0.318  0.318   1    
##  8 hurricane_force_diameter     lat                         0.190  0.190   1    
##  9 day                          month                      -0.174 -0.170   0.743
## 10 tropicalstorm_force_diameter month                       0.171  0.171   1    
## # … with 35 more rows
## # ℹ Use `print(n = ...)` to see more rows

To plot the comparison of the top 20 correlation coefficients:

inspect_cor(storms, storms[-c(1:200), ]) %>% 
  slice(1:20) %>%
  show_plot()

Notes:

  • Smaller p_value indicates stronger evidence against the null hypothesis \(H_0: \rho_1 = \rho_2\) and an indication that the true correlation coefficients differ.
  • The visualisation illustrates the significance of the difference using a coloured bar underlay. Coloured bars indicate evidence of inequality of correlations, while gray bars indicate equality.
  • For a pair of features, if either coefficient is NA, the comparison is omitted from the visualisation.
  • The significance level can be specified using the alpha argument to inspect_cor(). The default is alpha = 0.05.