Illustrative data: starwars

The examples below make use of the starwars and storms data from the dplyr package

# some example data
data(starwars, package = "dplyr")
data(storms, package = "dplyr")

For illustrating comparisons of dataframes, use the starwars data and produce two new dataframes star_1 and star_2 that randomly sample the rows of the original and drop a couple of columns.

library(dplyr)
star_1 <- starwars %>% sample_n(50)
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)

inspect_num() for a single dataframe

inspect_num() combining some of the functionality of summary() and hist() by returning summaries of numeric columns. inspect_num() returns standard numerical summaries (min, q1, mean, median,q3, max, sd), but also the percentage of missing entries (pcnt_na) and a simple histogram (hist).

library(inspectdf)
inspect_num(storms, breaks = 10)
## # A tibble: 10 × 10
##    col_name      min     q1 median    mean     q3    max     sd pcnt_na hist    
##    <chr>       <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <named >
##  1 year       1975   1992   2002   2001.   2011   2020    12.0      0   <tibble>
##  2 month         1      8      9      8.78    9     12     1.26     0   <tibble>
##  3 day           1      8     16     15.8    24     31     8.94     0   <tibble>
##  4 hour          0      6     12      9.12   18     23     6.73     0   <tibble>
##  5 lat           7.2   17.5   24.6   24.8    31.3   51.9   8.51     0   <tibble>
##  6 long       -109.   -80.7  -64.4  -64.1   -48.4   -6    19.7      0   <tibble>
##  7 wind         10     35     45     53.6    65    160    26.2      0   <tibble>
##  8 pressure    882    985    999    992.   1006   1022    19.5      0   <tibble>
##  9 tropicals…    0     60    120    145.    210    870   126.      54.9 <tibble>
## 10 hurricane…    0      0      0     18.1    25    300    35.5     54.9 <tibble>

The hist column is a list whose elements are tibbles each containing the relative frequencies of bins for each feature. These tibbles are used to generate the histograms when show_plot = TRUE. For example, the histogram for starwars$birth_year is

inspect_num(storms)$hist$pressure
## # A tibble: 29 × 2
##    value           prop
##    <chr>          <dbl>
##  1 [880, 885) 0.0000843
##  2 [885, 890) 0.000169 
##  3 [890, 895) 0.000422 
##  4 [895, 900) 0.000253 
##  5 [900, 905) 0.000506 
##  6 [905, 910) 0.000928 
##  7 [910, 915) 0.00110  
##  8 [915, 920) 0.00253  
##  9 [920, 925) 0.00312  
## 10 [925, 930) 0.00371  
## # … with 19 more rows
## # ℹ Use `print(n = ...)` to see more rows

A histogram is generated for each numeric feature by passing the result to the show_plot() function:

inspect_num(storms, breaks = 10) %>%
  show_plot()

inspect_num() for two dataframes

When comparing a pair of dataframes using inspect_num(), the histograms of common numeric features are calculated, using identical bins. The list columns hist_1 and hist_2 contain the histograms of the features in the first and second dataframes. A formal statistical comparison of each pair of histograms is calculated using Fisher’s exact test, the resulting p value is reported in the column fisher_p.

When show_plot = TRUE, heat plot comparisons are returned for each numeric column in each dataframe. Where a column is present in only one of the dataframes, grey cells are shown in the comparison. The significance of Fisher’s test is illustrated by coloured vertical bands around each plot: if the colour is grey, no p value could be calculated, if blue, the histograms are not found to be significantly different otherwise the bands are red.

inspect_num(storms, storms[-c(1:10), -1])
## # A tibble: 10 × 5
##    col_name                     hist_1            hist_2               jsd  pval
##    <chr>                        <named list>      <named list>       <dbl> <dbl>
##  1 year                         <tibble [23 × 2]> <tibble>         1.13e-5  1   
##  2 month                        <tibble [22 × 2]> <tibble>         3.99e-6  1.00
##  3 day                          <tibble [16 × 2]> <tibble>         1.21e-6  1   
##  4 hour                         <tibble [23 × 2]> <tibble>         9.25e-9  1   
##  5 lat                          <tibble [23 × 2]> <tibble>         3.70e-7  1   
##  6 long                         <tibble [21 × 2]> <tibble>         1.21e-6  1   
##  7 wind                         <tibble [15 × 2]> <tibble>         2.73e-7  1.00
##  8 pressure                     <tibble [29 × 2]> <tibble>         1.33e-6  1   
##  9 tropicalstorm_force_diameter <tibble [18 × 2]> <tibble>         0        1   
## 10 hurricane_force_diameter     <tibble [15 × 2]> <tibble>         0        1
inspect_num(storms, storms[-c(1:10), -1]) %>% 
  show_plot()