Understanding Data Science Tools Usage With Kaggle And Treemapify

4 minute read

Understanding data science tool usage with treemapify and kaggle

This post will explore the Kaggle 2017 Data Science survey a bit and will show off the treemapify package and in particular the treemap geom which is a neat way of presenting percentage or composition data.

The Kaggle ML and Data science survey

You will need to be a member of Kaggle to access the data to reproduce this post. Once you’ve signed up, you can download the survey data from the Kaggle website.

library(tidyverse)
library(treemapify)
library(colorspace)

# load in the kaggle survey results
survey <- read_csv("multipleChoiceResponses.csv", guess_max = 20000)
# have a wee peek at the data
head(survey[, 3:6])
## # A tibble: 6 x 4
##     Age                                     EmploymentStatus StudentStatus
##   <int>                                                <chr>         <chr>
## 1    NA                                   Employed full-time          <NA>
## 2    30                   Not employed, but looking for work          <NA>
## 3    28                   Not employed, but looking for work          <NA>
## 4    56 Independent contractor, freelancer, or self-employed          <NA>
## 5    38                                   Employed full-time          <NA>
## 6    46                                   Employed full-time          <NA>
## # ... with 1 more variables: LearningDataScience <chr>

Kaggle managed to get a big response to their survey with 16716 respondents, and 228 features associated with each of those. There’re a lot of different ways you could turn the data to find different insights about the state of data science: to get started take a look at Kaggle’s own report or have a look at some of the contributed analyses from other Kagglers.

In this post, I’m going to focus on understanding the frequency with which various software tools are used by respondents that identify as data scientists. Frequency of tools usage is contained in columns whose names are prepended with WorkToolsFrequency.

List of data science tools in the survey

There are 50 columns each corresponding to a different analytical tool. The full list is provided below:

# select the tools usage columns
# return the names of the columns
# remove the prepended string `WorkToolsFrequency`
survey %>% 
  select(contains("WorkToolsFrequency")) %>% 
  colnames() %>% 
  gsub("WorkToolsFrequency", "", .)
##  [1] "AmazonML"             "AWS"                  "Angoss"              
##  [4] "C"                    "Cloudera"             "DataRobot"           
##  [7] "Flume"                "GCP"                  "Hadoop"              
## [10] "IBMCognos"            "IBMSPSSModeler"       "IBMSPSSStatistics"   
## [13] "IBMWatson"            "Impala"               "Java"                
## [16] "Julia"                "Jupyter"              "KNIMECommercial"     
## [19] "KNIMEFree"            "Mathematica"          "MATLAB"              
## [22] "Azure"                "Excel"                "MicrosoftRServer"    
## [25] "MicrosoftSQL"         "Minitab"              "NoSQL"               
## [28] "Oracle"               "Orange"               "Perl"                
## [31] "Python"               "Qlik"                 "R"                   
## [34] "RapidMinerCommercial" "RapidMinerFree"       "Salfrod"             
## [37] "SAPBusinessObjects"   "SASBase"              "SASEnterprise"       
## [40] "SASJMP"               "Spark"                "SQL"                 
## [43] "Stan"                 "Statistica"           "Tableau"             
## [46] "TensorFlow"           "TIBCO"                "Unix"                
## [49] "Select1"              "Select2"

Each of the names above references a column which contains the response from each respondent describing how often they used a particular tool. For example, the response for Python:

# distribution of responses about python use
survey %>% 
  select(contains("WorkToolsFrequencyPython")) %>% 
  table(useNA = "ifany")
## .
## Most of the time            Often           Rarely        Sometimes 
##             3330             1294              346             1020 
##             <NA> 
##            10726

Note that many respondents did not respond to this question, and a large number of rows are populated enirely by NA entries as a result:

# how many respondents didn't respond at all to this quession?
survey %>% 
  select(contains("WorkToolsFrequency")) %>% 
  apply(., 1, function(v) sum(!is.na(v)) > 0) %>%
  table()
## .
## FALSE  TRUE 
##  8846  7870

In fact, over 8800 don’t seem to have responded, and perhaps they skipped this question altogether. It might make sense to remove these rows, for example if we wanted to report the overall percentage of respondants that used Python Most of the time. In the next section, we’ll simply tally the number of responses by column, and can ignore this missingness.

Generate treemap using ggplot2

For treemap, we’ll binarise the 50 columns with 1 representing Most of the time and 0 representing any other response type (inluding non-response). We’ll treat each non-zero element as a ‘vote’ for a tool, keeping in mind that each respondant can vote for more than one tool. It would be interesting to see a plot showing the proportion of votes cast for a particular piece of software as a proportion of the whole. The code below performs a pre-processing step:

survey_filter <- survey %>% 
  filter(DataScienceIdentitySelect == "Yes") %>% 
  select(contains("WorkToolsFrequency")) %>% 
  lapply(function(v) sum(v %in% c("Most of the time"), na.rm = T)) %>% 
  as_tibble() %>% 
  gather() %>%
  transmute(Tools = gsub("WorkToolsFrequency", "", key), Frequency = value)

The statement above takes the following steps:

  • filter rows to only those respondents that self-identify as data scientists
  • select columns prepended with the string WorkToolsFrequency
  • Use lapply to find the total number of times respondents responded with Most of the time usage, for each column
  • the result of lapply is a list - convert back into a tibble using as_tibble
  • use gather to reshape the data, so that the column name becomes a categorical column
  • Use transmute to renames frequency column, and strip out the string WorkToolsFrequency so that we’re left with a simpler tool name

From the top 5, it’s pretty clear that Python is the stand-out tool, followed by R, SQL and Jupyter.

survey_filter %>% 
  arrange(desc(Frequency)) %>% 
  slice(1:10)
## # A tibble: 10 x 2
##         Tools Frequency
##         <chr>     <int>
##  1     Python      1033
##  2          R       646
##  3        SQL       521
##  4    Jupyter       371
##  5       Unix       264
##  6 TensorFlow       169
##  7        AWS       132
##  8     MATLAB       131
##  9          C       123
## 10    Select1       103

Finally, we’re ready to generate our treemap showing the composition of votes:

survey_filter %>% 
  ggplot(aes(area = Frequency, fill = 1:nrow(.), label = Tools)) + 
  geom_treemap(show.legend = F, fill = sample(rainbow_hcl(50),50), col = 1)  + 
  geom_treemap_text(min.size = 2, colour = "white", grow = T)

Updated: