Exploring categorical data with inspectdf
inspectdf and what’s it for?
I often find myself viewing and reviewing dataframes throughout the
course of an analysis, and a substantial amount of time can be spent
rewriting the same code to do this.
inspectdf is an R package designed
to make common exploratory tools a bit more useful and easy to use.
In particular, it’s very powerful be able to quickly see the contents of
categorical features. In this article, we’ll summarise how to use the
inspect_cat() function from
inspectdf for summarising and
visualising categorical columns.
First of all, you’ll need to have the
inspectdf package installed. You
can get it from github using
Then load the package in. We’ll also load
dplyr for the
data and for the pipe
library(inspectdf) library(dplyr) # check out the starwars help file ?starwars
Tabular summaries using
starwars data that comes bundled with
dplyr has 7 columns that
have character class, and is therefore a nice candidate for illustrating
the use of
inspect_cat. We can see this quickly using the
inspect_types() function from
starwars %>% inspect_types()
## # A tibble: 4 x 4 ## type cnt pcnt col_name ## <chr> <int> <dbl> <list> ## 1 character 7 53.8 <chr > ## 2 list 3 23.1 <chr > ## 3 numeric 2 15.4 <chr > ## 4 integer 1 7.69 <chr >
inspect_cat() is very straightforward:
star_cat <- starwars %>% inspect_cat() star_cat
## # A tibble: 7 x 5 ## col_name cnt common common_pcnt levels ## <chr> <int> <chr> <dbl> <list> ## 1 eye_color 15 brown 24.1 <tibble [15 × 3]> ## 2 gender 5 male 71.3 <tibble [5 × 3]> ## 3 hair_color 13 none 42.5 <tibble [13 × 3]> ## 4 homeworld 49 Naboo 12.6 <tibble [49 × 3]> ## 5 name 87 Ackbar 1.15 <tibble [87 × 3]> ## 6 skin_color 31 fair 19.5 <tibble [31 × 3]> ## 7 species 38 Human 40.2 <tibble [38 × 3]>
So what does this tell us? Each row in the tibble returned from
inspect_cat() corresponds to each categorical column (
character) in the
cntcolumn tells you how many unique levels there are for each column. For example, there are 15 unique entries in the
commoncolumn prints the most commonly occurring entry. For example, the most common
brown. The percentage occurrence is 24.1% which is shown under
- A full list of levels and occurrence frequency is provided in the
A table of relative frequencies of
eye_color can be retrieved by
## # A tibble: 15 x 3 ## value prop cnt ## <chr> <dbl> <int> ## 1 brown 0.241 21 ## 2 blue 0.218 19 ## 3 yellow 0.126 11 ## 4 black 0.115 10 ## 5 orange 0.0920 8 ## 6 red 0.0575 5 ## 7 hazel 0.0345 3 ## 8 unknown 0.0345 3 ## 9 blue-gray 0.0115 1 ## 10 dark 0.0115 1 ## 11 gold 0.0115 1 ## 12 green, yellow 0.0115 1 ## 13 pink 0.0115 1 ## 14 red, blue 0.0115 1 ## 15 white 0.0115 1
There isn’t anything here that can’t be obtained by using the base
table() function with some post-processing.
some of that functionality and wraps it into a single, convenient
Visualising categorical columns with
An important feature of
inspectdf is the ability to visualise
dataframe summaries. Visualising categories can be challenging, because
categorical columns can be very rich and contain many unique levels. A
simple stacked barplot can be produced using
star_cat %>% show_plot()
star_cat tibble returned by
inspect_cat(), each row of the
plot is a single column, split by the relative frequency of occurrence
of each unique entry.
- Some of the bars are labelled, but in cases where the bars are
small, the labels are not shown. If you encounter categorical
columns with really long strings, labels can be suppressed
show_plot(text_labels = FALSE).
- Missing values or
NAs are shown as gray bars. In this case, there are quite a few
homeworldis not unknown or missing.
Combining rare entries with
Some of the categorical columns like
name seems to have a lot of
unique entries. We should expect this - names often are unique (or
almost) in a small dataset. If we scaled this analysis up to a dataset
with millions of rows, there would be so many names with very small
relative frequencies that the name bars would be very difficult to see.
show_plot() can help with this too!
star_cat %>% show_plot(high_cardinality = 1)
By setting the argument
high_cardinality = 1 all entries that occur
only once are combined into a single group labelled high
cardinality. This makes it easier to see when some entries occur only
once (or extremely rarely).
- In the above, it’s now obvious that no two people in the
starwarsdata share the same name, and that many come from a unique
- By setting
high_cardinality = 2or even greater, it’s possible to group the ‘long-tail’ of rare categories even further. With larger datasets, this becomes increasingly important for visualisation.
- A practical reason to combine rare entries, is plotting speed - it
can take a long time to render a plot with tens of thousands (or
more) unique bars! Using the
high_cardinalityargument can reduce this dramatically.
Playing with color options in
It’s been pointed out that the default
ggplot color theme isn’t
particularly friendly to color-blind audiences. A more color-blind
friendly theme is
available by specifying
col_palette = 1:
star_cat %>% show_plot(col_palette = 1)
I’m also quite fond of the 80s theme by choosing
col_palette = 2:
star_cat %>% show_plot(col_palette = 2)
There are 5 palettes at the moment, so have a play around. Note that the
color palettes have not yet hit the CRAN version of
inspectdf - that
will come soon in an update, but for now you can get them from the
github version of the package using the code at the start of the