For a single dataframe, summarise the levels of each categorical column. If two dataframes are supplied, compare the levels of categorical features that appear in both dataframes. For grouped dataframes, summarise the levels of categorical features separately for each group.

inspect_cat(df1, df2 = NULL, include_int = FALSE)

Arguments

df1

A dataframe.

df2

An optional second data frame for comparing categorical levels. Defaults to NULL.

include_int

Logical flag - whether to treat integer columns as categories. Default is FALSE.

Value

A tibble summarising or comparing the categorical features in one or a pair of dataframes.

Details

For a single dataframe, the tibble returned contains the columns:

  • col_name, character vector containing column names of df1.

  • cnt integer column containing count of unique levels found in each column, including NA.

  • common, a character column containing the name of the most common level.

  • common_pcnt, the percentage of each column occupied by the most common level shown in common.

  • levels, a named list containing relative frequency tibbles for each feature.

For a pair of dataframes, the tibble returned contains the columns:

  • col_name, character vector containing names of columns appearing in both df1 and df2.

  • jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in relative frequencies of levels in a pair of categorical features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.

  • pval, the p-value corresponding to a NHT that the true frequencies of the categories are equal. A small p indicates evidence that the the two sets of relative frequencies are actually different. The test is based on a modified Chi-squared statistic.

  • lvls_1, lvls_2, the relative frequency of levels in each of df1 and df2.

For a grouped dataframe, the tibble returned is as for a single dataframe, but where the first k columns are the grouping columns. There will be as many rows in the result as there are unique combinations of the grouping variables.

Author

Alastair Rushworth

Examples

# Load dplyr for starwars data & pipe
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# Single dataframe summary
inspect_cat(starwars)
#> # A tibble: 8 × 5
#>   col_name     cnt common    common_pcnt levels           
#>   <chr>      <int> <chr>           <dbl> <named list>     
#> 1 eye_color     15 brown           24.1  <tibble [15 × 3]>
#> 2 gender         3 masculine       75.9  <tibble [3 × 3]> 
#> 3 hair_color    13 none            42.5  <tibble [13 × 3]>
#> 4 homeworld     49 Naboo           12.6  <tibble [49 × 3]>
#> 5 name          87 Ackbar           1.15 <tibble [87 × 3]>
#> 6 sex            5 male            69.0  <tibble [5 × 3]> 
#> 7 skin_color    31 fair            19.5  <tibble [31 × 3]>
#> 8 species       38 Human           40.2  <tibble [38 × 3]>

# Paired dataframe comparison
inspect_cat(starwars, starwars[1:20, ])
#> # A tibble: 8 × 5
#>   col_name      jsd     pval lvls_1            lvls_2           
#>   <chr>       <dbl>    <dbl> <named list>      <named list>     
#> 1 eye_color  0.0936 7.08e- 1 <tibble [15 × 3]> <tibble [8 × 3]> 
#> 2 gender     0.0387 3.38e- 1 <tibble [3 × 3]>  <tibble [2 × 3]> 
#> 3 hair_color 0.261  9.04e- 4 <tibble [13 × 3]> <tibble [10 × 3]>
#> 4 homeworld  0.394  2.21e- 2 <tibble [49 × 3]> <tibble [11 × 3]>
#> 5 name       0.573  9.35e-11 <tibble [87 × 3]> <tibble [20 × 3]>
#> 6 sex        0.0526 5.19e- 1 <tibble [5 × 3]>  <tibble [4 × 3]> 
#> 7 skin_color 0.288  1.58e- 1 <tibble [31 × 3]> <tibble [10 × 3]>
#> 8 species    0.300  7.86e- 2 <tibble [38 × 3]> <tibble [6 × 3]> 

# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_cat()
#> # A tibble: 21 × 6
#> # Groups:   gender [3]
#>    gender    col_name     cnt common     common_pcnt levels           
#>    <chr>     <chr>      <int> <chr>            <dbl> <named list>     
#>  1 feminine  eye_color      6 blue             35.3  <tibble [6 × 3]> 
#>  2 feminine  hair_color     6 brown            35.3  <tibble [6 × 3]> 
#>  3 feminine  homeworld     11 Naboo            17.6  <tibble [11 × 3]>
#>  4 feminine  name          17 Adi Gallia        5.88 <tibble [17 × 3]>
#>  5 feminine  sex            2 female           94.1  <tibble [2 × 3]> 
#>  6 feminine  skin_color     9 light            35.3  <tibble [9 × 3]> 
#>  7 feminine  species        8 Human            52.9  <tibble [8 × 3]> 
#>  8 masculine eye_color     13 brown            22.7  <tibble [13 × 3]>
#>  9 masculine hair_color    10 none             47.0  <tibble [10 × 3]>
#> 10 masculine homeworld     44 Tatooine         12.1  <tibble [44 × 3]>
#> # … with 11 more rows
#> # ℹ Use `print(n = ...)` to see more rows