Extended Tutorial: How to Predict Employee Turnover

periscope

 

When employees walk out the door, they take substantial value with them.

Not surprisingly then, the broader application of predictive modeling across the enterprise along with the emergence of HR Analytics is leading organizations to ask how HR can start using data to predict and ultimately reduce employee turnover.

The goal of this primer is to provide the basic guidelines and starter code you need for two essential yet powerful techniques for predicting employee turnover. The code is in R, the premier open-source tool for data manipulation, analysis, and visualization.

By applying the foundational techniques we are sharing here to your own organization, you can quickly accelerate the impact of HR Analytics  and place yourself at the forefront of that discussion.

Table of Contents

What You Will Learn

  1. How to identify candidate variables using basic descriptive analytic techniques
  2. How to build predicitive models using Logistic Regression and Decision Trees
  3. How to evaluate your models using accuracy and the “ROC” method

Goal and Overview

The specific goal here is to predict whether an employee will stay or voluntary leave within the next year. In the present data, this means predicting the variable “vol_leave” (0 = stay, 1 = leave) using the other columns of data. You can think of this data as historical data which tells us who did and who did not leave within the last year.

Our initial step is to describe and visualize our data. Then, we will develop two different kinds of predictive models. The first of these is a logistic regression model. Logistic regression models predict the likelihood of a categorical outcome, here staying or leaving.

The second kind of model is known as a decision tree (or a classification tree). A decision tree is essentially a set of rules for splitting the data into buckets to help us predict whether the employees in those buckets will end up in one group (staying) or another group (leaving).

In both cases, we are classifying into just two possible groups. This is known as “binary classification”.

After we lay the foundation for model development, we then explain how to evaluate model quality using overall model accuracy and the Receiver Operator Curve (ROC). The ROC tells us what the best “threshold” or “cutoff” value to use when determining whether someone will leave or not.

While there are certainly more complex techniques for predicting turnover, logistic regression and decision trees both work extraordinarily well. Moreover, they are comparatively easy to implement and, most importantly, easier to interpret and explain. This is critical when translating modeling insights into action.

Data and R Libraries

For the purposes of this primer, we have created a simulated but nonetheless realistic dataset for you to download here . This dataset has 8 different variables like those that you might have in your own HR data. To find out more about considerations for selecting variables that help predict turnover, see our variables checklist (along with the related post) as well as these posts on employee demographics, and  team-level factors.

After downloading the data, run the following code to set up the packages you will need. Remember that if you do not already have these packages installed, you can easily install them using either the Packages –> Install option in the RStudio console (preferred) or the install.packages() command. Like everything with open-source R, these packages are totally free.

library(plyr)
library(dplyr)
library(ggplot2)
library(caTools) # for sample splitting
library(ROCR) # for the ROC/ AUC measure
library(pROC) # for the ROC/ AUC measure
library(rattle) # Visualization of Decision Trees
library(rpart.plot)
library(RColorBrewer)
library(psych) # for point biserial correlation

# Load data ---------------------------------------------------------------

# d <- read.csv("your_file_location/Final_Turnover_Data_CSV.csv")

d <- read.csv("./Final_Turnover_Data_CSV.csv")

Data Exploration and Visualization

As I have discussed in a *** previous post ***, there are essentially three levels of analytics: Descriptive, Predictive, and Prescriptive.

Our goal is to predict voluntary turnover but remember that everything, and I really do mean EVERYTHING, starts with with describing and visualizing our data.

Let’s start with the dimensions and the summary command.

dim(d) # dataset size
## [1] 11111     8
summary(d) # basic summary stats
##        role            perf               area          sex      
##  CEO     :    1   Min.   :1.000   Accounting:1609   Female:6068  
##  Director:  100   1st Qu.:2.000   Finance   :1677   Male  :5043  
##  Ind     :10000   Median :2.000   Marketing :2258                
##  Manager : 1000   Mean   :2.198   Other     :2198                
##  VP      :   10   3rd Qu.:3.000   Sales     :3369                
##                   Max.   :3.000                                  
##        id             age            salary          vol_leave     
##  Min.   :    1   Min.   :22.02   Min.   :  42168   Min.   :0.0000  
##  1st Qu.: 2778   1st Qu.:24.07   1st Qu.:  57081   1st Qu.:0.0000  
##  Median : 5556   Median :25.70   Median :  60798   Median :0.0000  
##  Mean   : 5556   Mean   :27.79   Mean   :  65358   Mean   :0.3812  
##  3rd Qu.: 8334   3rd Qu.:28.49   3rd Qu.:  64945   3rd Qu.:1.0000  
##  Max.   :11111   Max.   :62.00   Max.   :1000000   Max.   :1.0000

With over 11,000 cases, we have a good foundation to work with (although you don’t need this many employees to apply these methods to your own company). The id variable is just an employee number so we won’t include that in our model or our breakdown. We would still need this information though to map our results to our workforce.

The summary data tells us that we have 5 basic roles: CEO, VPs, Directors, Managers, and Individual Contributors. My working assumption is that CEOs and VPs experience a very different labor market than Directors, Managers, and Individuals. Plus, we don’t have very many CEOs and VPs anyway so including them in our modeling effort doesn’t make sense.

Let’s filter them out.

d <- filter(d, role == "Ind" | role == "Manager" | role == "Director") 
d$role <- factor(d$role) # resetting role so the CEO and VP are listed anymore 
summary(d)
##        role            perf               area          sex      
##  Director:  100   Min.   :1.000   Accounting:1607   Female:6064  
##  Ind     :10000   1st Qu.:2.000   Finance   :1676   Male  :5036  
##  Manager : 1000   Median :2.000   Marketing :2255                
##                   Mean   :2.198   Other     :2197                
##                   3rd Qu.:3.000   Sales     :3365                
##                   Max.   :3.000                                  
##        id             age            salary         vol_leave     
##  Min.   :   12   Min.   :22.02   Min.   : 42168   Min.   :0.0000  
##  1st Qu.: 2787   1st Qu.:24.07   1st Qu.: 57080   1st Qu.:0.0000  
##  Median : 5562   Median :25.70   Median : 60788   Median :0.0000  
##  Mean   : 5562   Mean   :27.77   Mean   : 64860   Mean   :0.3815  
##  3rd Qu.: 8336   3rd Qu.:28.48   3rd Qu.: 64928   3rd Qu.:1.0000  
##  Max.   :11111   Max.   :61.67   Max.   :311131   Max.   :1.0000

That’s much better.

Note the average of our 0/1 “leaving” variable. Because we have coded the voluntary termination variable as a 0 (stay) or 1 (leave), the mean tells us that the overall, a whopping 38% percent of our employees left in the last year (at this fictional company anyway)

Now let’s do some basic investigation of the remaining variables and get a feel for the their relationship to turnover.

Performance

We’ll start with performance, here graded on a simple 1 (low) to 3 (high) scale.

# A Nicer color palette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

table(d$perf)
## 
##    1    2    3 
## 1117 6667 3316
hist(d$perf, col = cbPalette[4], main = "Distribution of Performance Ratings", 
     xlab = "Performance Rating", ylab = "# Employees" )

The table and histograms give us a look at the distribution. With only three possible values, it doesn’t make sense to talk about it as being normal or skewed, but we can see that there are at least a reasonable number of individuals in each group.

When we use the aggregate function to break down the probabilities by performance group, we can see turnover is much higher for the high performance group. That’s a big problem and definitely something we’ll keep for the model.

agg_perf <- aggregate(vol_leave ~ perf, data = d, mean)
print(agg_perf)
##   perf vol_leave
## 1    1 0.3375112
## 2    2 0.3383831
## 3    3 0.4831122
ggplot(agg_perf, aes(x = perf, y = vol_leave)) + geom_bar(stat = "identity", fill = cbPalette[4]) + 
    ggtitle("Voluntary Termination Rate by Performance Rating") + 
    labs (y = "Proportion Leaving", x = "Performance Rating")

Business Area and Males/Females

Now let’s do the same breakdown for for business areas and male v. females.

agg_sex <- aggregate(vol_leave ~ sex, data = d, mean)
print(agg_sex)
##      sex vol_leave
## 1 Female 0.4673483
## 2   Male 0.2781970
ggplot(agg_sex, aes(x = sex, y = vol_leave)) + geom_bar(stat = "identity", fill = cbPalette[4]) + 
    ggtitle("Voluntary Termination Rate by Sex") + 
    labs (y = "Proportion Leaving", x = "Performance Rating")

Definitely major issues for our female employees v. the males. If this was your company, alarm bells should be going off and the investigative needs would start with more descriptive analyses and some additional, focused modeling.

Now let’s check out what happens with business areas.

agg_area <- aggregate(vol_leave ~ area, data = d, mean)
print(agg_area)
##         area vol_leave
## 1 Accounting 0.3055383
## 2    Finance 0.3126492
## 3  Marketing 0.2815965
## 4      Other 0.3040510
## 5      Sales 0.5696880
ggplot(agg_area, aes(x = area, y = vol_leave)) + geom_bar(stat = "identity", fill = cbPalette[4]) + 
    ggtitle("Voluntary Termination Rate by Business Area") + 
    labs (y = "Proportion Leaving", x = "Business Area")

Those in Sales are much more likely to leave. Again, we’ll definitely want to keep this for our models too.

Are there any clear differences for males v. females within the different areas?

agg_as <- aggregate(vol_leave ~ area + sex, data = d, mean)
print(agg_as)
##          area    sex vol_leave
## 1  Accounting Female 0.3923337
## 2     Finance Female 0.3923497
## 3   Marketing Female 0.3691550
## 4       Other Female 0.3828383
## 5       Sales Female 0.6624795
## 6  Accounting   Male 0.1986111
## 7     Finance   Male 0.2168200
## 8   Marketing   Male 0.1785714
## 9       Other   Male 0.2071066
## 10      Sales   Male 0.4589309
ind <- order(agg_as$area, agg_as$sex) #reordering so make comparisons easier

print(agg_as[ind,])
##          area    sex vol_leave
## 1  Accounting Female 0.3923337
## 6  Accounting   Male 0.1986111
## 2     Finance Female 0.3923497
## 7     Finance   Male 0.2168200
## 3   Marketing Female 0.3691550
## 8   Marketing   Male 0.1785714
## 4       Other Female 0.3828383
## 9       Other   Male 0.2071066
## 5       Sales Female 0.6624795
## 10      Sales   Male 0.4589309
ggplot(agg_as, aes(x = area, y = vol_leave, fill = sex)) + 
    geom_bar(aes(fill = sex), stat = "identity", position=position_dodge()) + 
    scale_fill_manual(values=cbPalette[3:4]) +
    ggtitle("Voluntary Termination Rate by Business Area") + 
    labs (y = "Proportion Leaving", x = "Business Area")

Looking at both the table and the plot, we can see the difference is generally 20% across all areas. There might a slightly elevated difference for females in the sales group, but nothing huge.

Any significant interaction should be fleshed out in our later modeling anyway so we can move on.

Age

Let’s see about the age of our employees

hist(d$age, breaks = 100, col = cbPalette[4], main = "Age Distribution", border = F, xlab = "Age")

quantile(d$age)
##       0%      25%      50%      75%     100% 
## 22.02289 24.07050 25.69533 28.48035 61.67132

There is a strong skew here, with 50% of our workforce between 22 and 26 years of age. Let’s remember though that we have three different levels: individuals, managers, and directors.

It will be more informative to see how those ages breakdown when we take that into account. Let’s use some boxplots. As a reminder, the black lines within each box represent the median.

d$role <- factor(d$role, levels = c("Ind", "Manager", "Director")) # reording levels of role

boxplot(age ~ role, data = d, col= cbPalette[4])

Our boxplot shows strong differences in age by the role with no meaningful overlap.

This means there is a strong relationship between Age and Role, although the difference would not likely be so pronounced with data from a real company.

Regardless, if we only retained one of these variables in the model, we might mistakenly attribute turnover differences to age when they are instead more related to role, or vice versa.

This is a reminder that it’s important to take a good look at your data BEFORE you start your modeling journey. If you don’t you risk drawing the wrong conclusions.

Still, given that age is extremely skewed, we should be concerned about including it in our logistic regression model as is.

We’ll address that by creating a new variable that takes the log of age. This is known as a “data transformation” is often done to make the distributions of a continuous variable more normally distributed.

This will give us a less skewed value and reduce any problems for inclusion in our later model.

d$log_age <- log(d$age)

That said, let’s break down the turnover rates by role and by our non-transformed age measure separately for now.

agg_role <- aggregate(vol_leave ~ role, data = d, mean)
print(agg_role)
##       role vol_leave
## 1      Ind    0.3749
## 2  Manager    0.4580
## 3 Director    0.2800
ggplot(agg_role, aes(x = role, y = vol_leave)) + 
    geom_bar(stat = "identity", fill = cbPalette[4], width = .7) + 
    ggtitle("Voluntary Termination Rate by Role") + 
    labs (y = "Proportion Leaving", x = "Role") 

Here we can see that our Managers are particularly likely to leave.

To break make our Age breakdown more manageable, we’ll break up age into 10 pieces using the cut function.

We’ll use a slightly different formulation of the aggregate function here as a result.

agg_age <- aggregate(x = d$vol_leave, by = list(cut(d$age, 10)), mean)
print(agg_age)
##        Group.1         x
## 1      (22,26] 0.3866177
## 2      (26,30] 0.3645902
## 3    (30,33.9] 0.3374536
## 4  (33.9,37.9] 0.3992806
## 5  (37.9,41.8] 0.4155405
## 6  (41.8,45.8] 0.4640288
## 7  (45.8,49.8] 0.4333333
## 8  (49.8,53.7] 0.4260870
## 9  (53.7,57.7] 0.4666667
## 10 (57.7,61.7] 0.2727273
names(agg_age) <- c("Age", "Probability")
    
ggplot(agg_age, aes(x = Age, y = Probability)) + 
    geom_bar(stat = "identity", fill = cbPalette[4], width = .7) + 
    ggtitle("Voluntary Termination Rate by Role and Age") + 
    labs(y = "Proportion Leaving", x = "Age Cut") 

This plot suggests a steady increase in turnover rate between age 34 and 46, a generally elevated rate until 58, and followed by a steep dropoff. It’s important to also remember that we arbitrarily chose to slice age into 10 pieces, but the broad pattern is still informative.

Salary

Finally, let’s take a look at salary.

summary(d$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42170   57080   60790   64860   64930  311100
hist(d$salary, breaks = 50, col = cbPalette[4], main = "Salary Distribution", 
     xlab = "Salary")

quantile(d$salary, probs = seq(0,1,.2))
##        0%       20%       40%       60%       80%      100% 
##  42168.22  56189.17  59385.03  62307.14  66151.43 311130.51
aggregate(salary ~ role, data = d, mean)
##       role    salary
## 1      Ind  60069.31
## 2  Manager  99355.64
## 3 Director 198950.83
boxplot(salary ~ role, data = d, col = cbPalette[4], main = "Salary by Role")

We can see three distinct pieces, corresponding tightly to the role. Given the huge differences in salary, we should find another way to represent salary that is more meaningful across the roles.

One way is think about salary relative to one’s peers. For example, an individual contributor would be delighted to make $80,000 considering the median for that group. A manager, on the other hand, might not given the median manager salary of roughly $100,000.

This suggests that looking at salary relative to the median of an employee’s role makes more sense.

To do this, we’ll first calculate the median salary for each group and then create a new salary measure that measures a salary relative to that employee’s role median.

In this case, we’ll simply get the difference between the employee’s salary and that of the median. By this new metric, salaries lower than the median of the relevant role will be negative, those higher than the median will be positive.

This will put the salaries on a relative footing.

Observe that in data science parlance, this step of creating a new predictor variable from the data available is part of “feature engineering”. This is a huge topic by itself but the lesson for you is this: you DO NOT need to just stick with the original raw variables; those are just the starting point.

median_role_salary <- aggregate(salary ~ role, data = d, median) # get the median
print(median_role_salary)
##       role    salary
## 1      Ind  60102.17
## 2  Manager  99545.18
## 3 Director 195598.67
names(median_role_salary)[2] <- "role_median_salary" #rename the median variable

# using merge to add the median salary measure for each person according to role

d <- merge(d, median_role_salary, by = "role")

d$sal_med_diff <- d$salary - d$role_median_salary # creating a difference metric

Let’s see how this new salary metric look and examine its relationship to turnover. We’ll use a “point biserial” correlation for calculating the relationship between the continuous variable (difference of salary from the median) and the categorical variable (0 = stay, 1 = leave).

hist(d$sal_med_diff, breaks = 50, 
     main = "Distribution of Salary Differences \n from Role Median Salary", 
     col = cbPalette[4], xlab = "Difference from Median")

aggregate(sal_med_diff ~ vol_leave, data = d, mean)
##   vol_leave sal_med_diff
## 1         0     827.8958
## 2         1   -1385.2237
biserial(d$sal_med_diff, y = d$vol_leave)
##            [,1]
## [1,] -0.1969457

The histogram shows a generally normal shape, much nicer for modeling purposes. Looking at the mean salary difference split by leave/ stay status also suggests some relationship.

The point biserial suggests a negative relationship, indicating that as one’s salary increases relative to the role median, the likelihood of leaving decreases. This makes sense since those earning more relative to their true peers in the same role are less likely to be looking elseswhere for a better deal.

We’ll this new metric for our models instead of raw salary.

Data Splitting

Before we start creating models, we need to split our data into a training set and a test set.

Why?

Remember that the goal of predictive modeling to predict future outcomes The best way to test our models is to see how well they do on data they have not yet seen.

If we don’t test our models on new data that was not part of the model development process, we will overestimate the quality of our model. Indeed, some modeling methods are so powerful that they can account for 100% of the the outcomes in data used for training the model.

Such models are generally just following the noise and would not generalize well to new data. There are more sophisticated methods for splitting your data into training and test sets, but simple splitting is the best way to start for now. You can always dig deeper into the issues and methods after you get oriented to the concept.

Bottom line? You can’t know how well your model works until you test it on data that it has not yet seen.

In the present case, we will use two-thirds of the data for training and model development and the remaining third of the data for testing the models.

The splitting method here is random with the constraint that both sets will have the same proportion of leavers and stayers. We set the random seed to a specific (but arbitrary) number so we can always replicate our results.

set.seed(42) # setting the random seed for replication
spl <- sample.split(d$vol_leave, 2/3)
train <- d[spl,]
test <- d[!spl,]

# Checking proporion leaving is same for both sets
mean(test$vol_leave)
## [1] 0.3816216
mean(train$vol_leave)
## [1] 0.3814865

Our Models: A Brief Introduction

We are covering two modeling techniques: Logistic Regression and Decision Trees.

I have chosen these because they are easier to implement and much easier to interpret. There are of course more powerful modeling approaches but logistic regression and decision trees can get you an 80% solution with about 20% of the work. Indeed, when it comes to HR analytics, the fastest way to improve your model is generally through good variable selection and feature engineering, not fancier models.

To get a handle on these techniques, just remember the goal: predict a categorical response variable (staying or leaving) from the available predictor variables.

Note: in statistics and the social sciences they often refer to the “dependent variable” (response) and the “independent variables” (predictors).

How do these two techniques work? There are ample resources on these topics (see Resources), but in a nutshell, logistic regression constructs an equation that in effect predicts the likelihood of a two-category outcome (here, staying or leaving) using the selected predictors. Each of the predictors are associated with a “signficance”” indicator that tells you if the predictor is useful or not.

By contrast, decision trees work by using the predictors to split the data into buckets using a set of decision rules. The aim is to create a set of buckets where the outcomes are more uniform with respect to the outcome.

In our current data for example, the overall probability of leaving is 38% and we are working to create different buckets in which some buckets of employees are much more likely to leave (say, 70%) and other buckets have employees are much more likely to stay (say, 80%).

You’ll see how this works in the examples below so let’s get started.

The Logistic Regression Model: Super Basics

With our data now split, we’ll use the glm function (general linear model) to create our logistic regression model. The formulation below shows that we are predicting the vol_leave variable (left side of formula) with the selected set of predictors (right side of formula).

In addition to including each of the variables we discussed in the descriptive analyses above, we are also including the “sex*area” interaction term.

This provides a test for whether the impact of the male/female variable on turnover differs according to the business area (or equivalently, whether the impact of business area on turnover differs for males v. females).

The “family” argument of the function is set to “binomial”, telling the model that we have a 0/1 response outcome.

Interpreting the Model Output

m1 <- glm(vol_leave ~ perf + role + log_age + sex + area + sal_med_diff + sex*area, data = train, family = 'binomial')
summary(m1) # getting model results
## 
## Call:
## glm(formula = vol_leave ~ perf + role + log_age + sex + area + 
##     sal_med_diff + sex * area, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4564  -0.9119  -0.6072   1.0910   3.0738  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            8.596e-01  8.136e-01   1.057  0.29071    
## perf                   4.602e-01  4.392e-02  10.477  < 2e-16 ***
## roleManager            6.892e-01  1.523e-01   4.526 6.01e-06 ***
## roleDirector          -2.874e-01  4.277e-01  -0.672  0.50168    
## log_age               -7.119e-01  2.474e-01  -2.877  0.00401 ** 
## sexMale               -1.154e+00  1.474e-01  -7.828 4.96e-15 ***
## areaFinance           -9.511e-02  1.224e-01  -0.777  0.43702    
## areaMarketing         -6.155e-02  1.136e-01  -0.542  0.58809    
## areaOther             -1.942e-01  1.150e-01  -1.689  0.09124 .  
## areaSales              1.167e+00  1.073e-01  10.880  < 2e-16 ***
## sal_med_diff          -6.418e-05  4.561e-06 -14.073  < 2e-16 ***
## sexMale:areaFinance    2.896e-01  2.030e-01   1.426  0.15375    
## sexMale:areaMarketing  1.537e-01  1.908e-01   0.805  0.42054    
## sexMale:areaOther      3.521e-01  1.927e-01   1.828  0.06760 .  
## sexMale:areaSales      2.685e-01  1.727e-01   1.555  0.12001    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9838.8  on 7399  degrees of freedom
## Residual deviance: 8675.5  on 7385  degrees of freedom
## AIC: 8705.5
## 
## Number of Fisher Scoring iterations: 4

Whoa! That’s a lot of output. What does it mean? Let’s focus on the basics.

The variable names are listed on the far left under “Coefficients”.

In our variables, we don’t see “Individual” listed. This is because there are only three roles and everyone is in one of those three roles.

R recognizes this and uses the “individuals” as a base line. It then includes the impact estimate for Managers and Directors relative to this baseline in the list output.The same is true for area (does not list “Accounting”) and sex (does not list “Female”“).

This DOES NOT mean the model ignored those values. It just means that it used those values as the baseline. These baselines can specified explicitly but that is beyond the scope of this primer.

The signifance values are provided in the list on the far right under (Pr>|z|). This tells us the probability of observing the estimated parameter value by chance. By convention, the magic number here is .05 (indicated by one or more *s).

Any parameter estimate with a Pr value of less than .05 is unlikely to be obtained by pure chance. We’ll definitely want to keep those significant variables in the model.

What about those non-significant parameter values? For the the purposes of this exercise, we will drop those predictors that are not significant. The primary benefit of dropping non-significant predictors here is just to keep the model reasonably simple. In addition, we are not presenting our model as a specific theoretical assertion or test of the importance/ non-importance of certain predictors. The reader should be aware, though, that this is rather complex issue and there are circumstances in which keeping non-significant predictors is a good idea (see also Resources below for a related link).

The Performance Variable

Not surprisingly, we see that those rated more highly on performance are significantly more likely to leave, consistent with our descriptive analyses.

The Role Variable

In the case of Managers, we see that the estimate is positive, meaning that Managers are significantly more likely to leave. Surprisingly though, the designation of Director did not significantly predict turnover.

Given the marked dropoff in turnover we saw for Directors in our descriptive analyses, how can this be? The short answer is that when we include other variables such as age and performance, the impact of Director per se disappears.

One might therfore consider combining roles to create a Managers v. Ind/ Directors. Sometimes this makes sense, but clearly merging those roles into just two categories is unnatural; we know directors are different in multiple ways.

Alternatively, and what I would actually consider in the real world before testing these models, is simply developing a separate model for Directors and dropping them from the individual and manager model; the factors that impact turnover for Directors are probably different people lower down on the ladder.

There are pros and cons of those approaches but we will ignore this issue for present purposes. Just be aware that these issues are quite common with the discretion left to the modeler.

The Age Variable

Our transformed age variable is significant and negative. This tells us that older employees are less likey to leave.

The Male/ Female Variable

We see a significant (p < .05) and negative parameter value males. This means males are less likely to leave than females.

The Area Variable

The interpretation for Area is a bit more complex. The bottom line is that those in the Sales are significantly more likely to leave (positive and statistically significant parameter value) while there are no significant differences associated with the other areas.

This is consistent with our descriptive analysis above. Again, to keep this primer focused, we will just leave the Area values in as they are. In principle, though, one might wish to simply code area as just two categories (Sales v. Non-Sales) to reduce the model complexity.

The Area X Sex Interaction Variable

Finally, let’s look at our interaction terms. Each term tells us whether being male impacts the contribution of area to retention. Across the board, we see that none of these are significant.

We will therefore drop this interaction term from our final model because it is not helping us predict which employees will leave and which will stay.

Final Logistic Regression Model

m2 <- glm(vol_leave ~ perf + role + log_age + sex + area + sal_med_diff, data = train, family = 'binomial')
summary(m2) # getting model results
## 
## Call:
## glm(formula = vol_leave ~ perf + role + log_age + sex + area + 
##     sal_med_diff, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4649  -0.9102  -0.6136   1.0909   3.0698  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    7.801e-01  8.121e-01   0.961  0.33677    
## perf           4.592e-01  4.390e-02  10.461  < 2e-16 ***
## roleManager    6.903e-01  1.523e-01   4.533 5.81e-06 ***
## roleDirector  -2.865e-01  4.280e-01  -0.669  0.50325    
## log_age       -7.120e-01  2.474e-01  -2.878  0.00401 ** 
## sexMale       -9.256e-01  5.345e-02 -17.318  < 2e-16 ***
## areaFinance    9.734e-03  9.719e-02   0.100  0.92023    
## areaMarketing -7.683e-03  9.062e-02  -0.085  0.93243    
## areaOther     -6.746e-02  9.186e-02  -0.734  0.46276    
## areaSales      1.269e+00  8.326e-02  15.237  < 2e-16 ***
## sal_med_diff  -6.427e-05  4.561e-06 -14.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9838.8  on 7399  degrees of freedom
## Residual deviance: 8679.7  on 7389  degrees of freedom
## AIC: 8701.7
## 
## Number of Fisher Scoring iterations: 4

Interpreting the Model Predictions

With our final model specified, it’s time to actually see the predicted output from the fitted values in the model output. The fitted values provide the predicted probability of leaving for each individual.

This individual-level data is really useful in the real world because we can then calculate probabilities of leaving for any set of arbitrarily defined groups after creating our model. For example, we can group our prediction to predict “Females in Sales”, “Managers in Accounting”, or even “low performing males under 30 in Sales”.

Individual-level predictors allow one to drill down and slice the data anyway you wish while still leveraging all of the data available across the workforce.

Let’s see the distribution.

hist(m2$fitted.values, main = "Distribution of Predicted Probabilities", 
     xlab = "Probability of Leaving", col = cbPalette[4], border = F, breaks = 50)
abline(v = .5, col = "red", lwd = 3)

prop.table(table(m2$fitted.values >= .5))
## 
##     FALSE      TRUE 
## 0.7417568 0.2582432
prop.table(table(m2$fitted.values >= .3))
## 
##     FALSE      TRUE 
## 0.3904054 0.6095946

Our histogram (and calculations) for the training data show us that we have roughly 25% of our employees with a probability of leaving at 50% or higher and 60% of our employees with a probability of leaving greater than 30%.

Now what? These data give us a feel for the output on the training data (and yes, you should ALWAYS plot them) but of course the real test is seeing how the model did with the novel test data.

Predicting with New Data

Measuring Accuracy

To figure out how acccurate our model is at predicting turnover, we need to first take our model and use it to predict the outcomes for our test dataset. We’ll use the “predict” function and feed it our final m2 model that we built above. We will also set the newdate to our “test” dataset from our split, and set type = “response” to get the predicted probabilities for each person.

m2_test <- predict(m2, newdata = test, type = "response")

hist(m2_test, main = "Distribution of Test Set \nPredicted Probabilities", 
     xlab = "Probability of Leaving", col = cbPalette[4], border = F, breaks = 50)
abline(v = .5, col = "red", lwd = 3)

prop.table(table(m2_test >= .5))
## 
##     FALSE      TRUE 
## 0.7316216 0.2683784
prop.table(table(m2_test>= .3))
## 
##     FALSE      TRUE 
## 0.3702703 0.6297297

Similar to our training set predictions, we see that 27% have a predicted probability greater than .5 and 63% with a predicted probability of greater than .3. Note again though that the model has provided the individual probabilities, not a 0/1 response.

The Cutoff

To get the 0/1 value we need to measure model accuracy, we’ll need to first select some cutoff value for predicting whether someone will leave or not. A good starting point is .5. After all, those with a probability value greater than .5 are more likely to leave than those with less than a .5 predicted probability.

Our reference point is the base rate which for the current data is 61.8% staying and 38.2% leaving. Thus, if we had no information available, we could simply predict that everyone will stay and be correct 61.8% of the time. You could think about this as a “majority class model” in which our best guess is picking the largest class.  Let’s hope our model does better.

The Confusion Matrix

To check accuracy, we’ll build a “confusion matrix”. This allows is to compare the observed results from our test set with our predicted results. We will set the cutoff as .5 so that anyone with a value over .5 will categorized as leaving.

prop.table(table(test$vol_leave))
## 
##         0         1 
## 0.6183784 0.3816216
accuracy <- table(m2_test > .5, test$vol_leave) # confusion matrix

addmargins(table(m2_test > .5, test$vol_leave))
##        
##            0    1  Sum
##   FALSE 1919  788 2707
##   TRUE   369  624  993
##   Sum   2288 1412 3700

The values on the right represent the predicted outcome (where False = 0 and True = 1). The values for the columns represent the real, observed outcomes.

To get the model accuracy, we will sum the correctly classified observations (values on the diagonal) and divide by the total number of observations (3700). This will give us the proportion correct.

sum(diag(accuracy))/sum(accuracy)
## [1] 0.6872973

The result is 68.7% correct v. the “no model” case of 61.8%. It’s not spectacular mind you, but definitely better than nothing.

Remember too that we are dealing with far fewer variables than you would normally have in your HR data. Given just a handful of predictors, the model is actually doing quite well.

Choosing a Cutoff

Let’s take a minute to think a bit more about that .5 cutoff.

addmargins(accuracy)
##        
##            0    1  Sum
##   FALSE 1919  788 2707
##   TRUE   369  624  993
##   Sum   2288 1412 3700

When we choose this cutoff, our model catches 624 of the 1412 turnover (leave) events. This gives us a True Positive Rate of 44% (624/1412). That is, we are correctly labeling 624 of the 1412 “leave” cases. Correspondingly, that same .5 cutoff catches 1919 of the 2288 stay events, yielding a True Negative Rate of 84% percent.

What happens if we lower our threshold to .3 instead?

acc2 <- table(m2_test > .3, test$vol_leave) # confusion matrix

addmargins(acc2)
##        
##            0    1  Sum
##   FALSE 1114  256 1370
##   TRUE  1174 1156 2330
##   Sum   2288 1412 3700

Here, we end up catching many more of the leave events (1156), giving us a much higher True Positive Rate of 82% (1156/1412). That’s a lot better, but loosening the cutoff criteria also dramtically lowers the True Negative Rate to 48% (1112/2288).

By lowering the criterion, we ended up labeling too many of the stayers as leavers. We also end up reducing our overall accuracy to 61.3% (2270/3700). The world is full of tradeoffs.

To make this more concrete, think instead of a test for a disease. In that case, we might be willing to lower the cutoff to make sure we are catching those with the disease, even at the expense of more False Positive results.

We will deal with this more in the ROC section below. The take home message for now is that changing that cutoff can have a big impact on your interpretation of the model; the actions you wish to take given your results should guide your decision.

The Decision Tree Model

With logistic regression, we needed to test to see which predictors to retain for our model and that can get complicated.

Decision Trees (DT) make everything easier by telling us which predictors are useful. Indeed, it is often helpful to include some DT modeling as part of your data exploration and descriptive analytics process because they can provide some hints as to what variables to consider.

More technically, DTs are robust to issues like predictors with skewed distributions because the underlying math does not depend on estimating parameter values and having nice, normally distributed predictors. In practice, you won’t need to worry about transforming your variable like we did with Age in our logistic regression model.

We also DO NOT need to specify or test any interaction terms when using decision trees. The tree building process will find any interactions the matter. I like that and so should you.

Basic Decision Tree

Let’s get started with a simple model with only a few predictors. As before, we will create our model using the training data and then use the resulting model to make predictions with our test data.

# Simple Model

set.seed(42)
fit <- rpart(vol_leave ~ role + age + sex + area, data=train,
             method="class")
fit # basic model results
## n= 7400 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 7400 2823 0 (0.6185135 0.3814865)  
##   2) area=Accounting,Finance,Marketing,Other 5188 1544 0 (0.7023901 0.2976099) *
##   3) area=Sales 2212  933 1 (0.4217902 0.5782098)  
##     6) sex=Male 1015  479 0 (0.5280788 0.4719212) *
##     7) sex=Female 1197  397 1 (0.3316625 0.6683375) *
# summary(fit) # more detail if you want it

# Great plotter from the Rattle package

par(mar = c(5,4,1,2)) # setting the margins
fancyRpartPlot(fit, sub = NULL, main = "Basic Decision Tree")

One thing is clear: trees are nice to look and, as you will see, much easier to interpret and communicate.

Interpreting the Model Output

Let’s start at the top. The first piece is referred to as the root. The “0” refers to the dominate (mode) case. Here, 62% of those in our training data have 0 (stayers) for the response variable and 38% have a 1 (leavers).

The root conveniently shows these overall base rates. In addition, it shows that this initial partition (based solely in the no-information base rate) accounts for 100% of the training data.

Below that, we see our first decision node. If our employees are in the Accounting, Finance, Marketing, or Other areas, then we say “yes” and take the left branch. If the answer is no (i.e. they are in sales), then we take the right branch.

Following the left branch, we see that it terminates into a single node. Think of this node like a bucket for all of those who are not in Sales. For all of these people, the most common (mode) response is “0”, with 70% people as stayers and only 30% in this bucket leaving. The “70%” reported in the bottom of the node tells us that this single bucket accounts for 70% of the total sample we are modeling.

Now, let’s follow the top right branch instead. We see here that the most common (mode) response is “1” for the leavers. We know this because there is a “1” at the top of the node and it is colored blue instead of green. In addition, the node is also telling us 42% of those in this “bucket” are stayers while 58% are leavers.

Finally, the node tells us that 30% of the total training sample we are modeling falls into this group. The 30% here plus the 70% from the terminating left node gives us 100% of this training sample.

aggregate(vol_leave ~ area, data = train, mean)
##         area vol_leave
## 1 Accounting 0.3012048
## 2    Finance 0.3057554
## 3  Marketing 0.2970684
## 4      Other 0.2893297
## 5      Sales 0.5782098

Continuing with the right node, we now get another decision point. This time, if the employee is a male, we say “yes” and go to the left. If female, we go to the right.

For the males, we end up in a terminating (light green) node with a dominant 0 response (stay), 53% to 47% for those in this bucket. This bucket accounts for 14% of the total overall sample in our training set, seen at the bottom of the node.

For females, we end up in a terminating node that has a dominant response of 1 (33% stayers v. 67% leavers). This terminating node accounts for 16% of the total population.

Notice that adding up all of the final node percentages gives us 100%. Each terminating node then tells us at a glance how many people are in what bucket.

More importantly, observe that the nodes are seeking to make the overall splits more uniform with respect to the outcomes within each group (darker colors indicate more similar 0/1 responses for those in a bucket).

You can see this by comparing our root node at the top (62% / 38% ) with the far left node (70%/30%) and the far right node (33%/ 67%). Yes, the middle node is a bit more mixed, but that reduced confidence in the predicted outcomes applies now to only 14% of our population.

Again, clean, neat, visually pleasing, easy to explain…and no confusing parameters to interpret.

Now that you understand the basics, let’s give it the full model.

Final Decision Tree and Output

To facilitate comparisons between the models in this tutorial, I am using the variables from the final model specified from the logistic regression. However, I will use the raw age variable (not the transformed one) because we don’t need to. Using the raw variable will make interpretation of Age easier for our decision rules.

Please note that I am only using the predictors from the earlier “final” logistic regression model to make it easier for us to compare and discuss the models in this tutorial. In your own decision tree modeling, you should use all of the reasonable predictors available; you DO NOT need to use other modeling methods beforehand to help you decide what to include.

set.seed(42)
fit <- rpart(vol_leave ~ perf + role + age + sex + area + sal_med_diff, data=train,
             method="class")
fit # basic model results
## n= 7400 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 7400 2823 0 (0.6185135 0.3814865)  
##    2) area=Accounting,Finance,Marketing,Other 5188 1544 0 (0.7023901 0.2976099) *
##    3) area=Sales 2212  933 1 (0.4217902 0.5782098)  
##      6) sex=Male 1015  479 0 (0.5280788 0.4719212)  
##       12) sal_med_diff>=-1022.925 590  234 0 (0.6033898 0.3966102) *
##       13) sal_med_diff< -1022.925 425  180 1 (0.4235294 0.5764706) *
##      7) sex=Female 1197  397 1 (0.3316625 0.6683375) *
# summary(fit) # more detail if you want it

# Great plotter from the Rattle package
par(mar = c(5,4,1,2)) #setting the margins
fancyRpartPlot(fit, sub = NULL, main = "Final Decision Tree")

There a few things to notice about the larger model. First, although we included several more predictors only the difference from median salary ended up being retained in the final model produced.

In addition, note that this predictor only emerges AFTER the “yes, male” decision node. You can think of this as being a form of interaction in which the impact of salary relative to the role median only mattered for the males in this simulated data.

In this model, if that difference is greater than or equal to -$1023 dollars, we answer “yes” and branch off to the left and end up in a 0 dominated noded. If the salary difference is more than $1023 below the role median, we branch right and end up in a mode dominated by leavers (42% stay, 58% leave). Those earning substantially less than the median for their role are seeking greener pastures.

Also observe the greater overall response uniformity in the final buckets. Again, the two middle buckets are a bit more mixed but they represent much smaller pieces of the total training set data.

In addition, remember that while this final model did not include the “Role” variable (Ind, Manager, Director) per se, we used Role when we constructed our salary difference measure. Score one for feature engineering…and thinking about the process and the people first.

Predicting with New Data

As before, we’ll use the predict function, taking our final decision tree model and apply it to the test data. Then we’ll take a quick look at the data structure to see what is going on.

fit_test <- predict(fit, newdata= test, type = "prob")

head(fit_test)
##            0         1
## 1  0.7023901 0.2976099
## 3  0.6033898 0.3966102
## 5  0.7023901 0.2976099
## 9  0.7023901 0.2976099
## 14 0.7023901 0.2976099
## 16 0.7023901 0.2976099

Notice here that now our output has two columns for probability, one for each of the possible outcomes. We’ll need to use the second column when creating our confusion matrix.

prop.table(table(test$vol_leave))
## 
##         0         1 
## 0.6183784 0.3816216
accuracy <- table(fit_test[,2] > .5, test$vol_leave) # confusion matrix

addmargins(table(fit_test[,2] > .5, test$vol_leave))
##        
##            0    1  Sum
##   FALSE 1965  877 2842
##   TRUE   323  535  858
##   Sum   2288 1412 3700
sum(diag(accuracy))/ sum(accuracy) #total on diagonals/ total obs
## [1] 0.6756757

Our accuracy is about 68%, in line with that from the logistic regression model. There are advantages and disadvantages to each modeling method, but decision trees are usually a good starting point and often there is limited loss of predictive power.

ROC Curves: What are They and Who Cares?

We’ve already discussed the impact of changing cutoffs on accuracy, but let’s dig into an alternative way of assessing our logistic regression and decision tree models: the ROC curve.

ROC stands for “Receiver Operating Characteristic” and was first used in WWII to aid radar detection processes. It is still used today in human cognition research, radiology, epidemiology, and predictive models of all sorts.

ROC curves help us evaluate binary classifiers by showing us how different cutoff (threshold) values impact the quality of our model. Said differently, it helps us figure out the best tradeoff between True Positive and False Positive rates.

Understanding ROC Curves: Developing Intuitions

To develop our intuitions, let’s go back to our logistic model and select five possible cutoffs: 0, .3, .5, . 7, and 1. These cutoffs are just illustrative.

Let’s plot our False Positive and True Positive values at each of these sample cutoffs.

# collector for the data that we will plot
rates <- data.frame(fp = numeric(3), tp = numeric(3), cutoff = as.numeric(3)) 
 
for (i in 1:3){ # cycling through the middle cutoff values to test

temp_cutoffs <- c(.3, .5, .7)    
# accuracy <- table(m2_test > i, test$vol_leave) # confusion matrix

temp_res <- addmargins(table(m2_test > temp_cutoffs[i], test$vol_leave))

rates$fp[i] <- temp_res[2,1]/temp_res[3,1]
rates$tp[i] <- temp_res[2,2]/temp_res[3,2]
rates$cutoff[i] <- temp_cutoffs[i]

}

# plotting the calculated rates

plot(rates$fp[1:3], rates$tp[1:3], pch = 19, col = "red", cex = 1.5, 
     xlim = c(0,1), ylim = c(0,1), xlab = "False Positive Rate", 
     ylab = "True Positive Rate",
     main = "False Positive and True Positive Rates \nFor Different Cutoffs")

# Lines connecting the dots
lines(c(0, rates$fp[3:1], 1), c(0,rates$tp[3:1],1), 
     xlim = c(0,1), ylim = c(0,1))

# Additional notes for the figure
abline(coef = c(0,1), col = "black", lwd = 2)

points(x = c(0,1), y = c(0,1),pch = 19, col = "red", cex = 1.5) 
points(x = 0, y = 1, pch = 19, col = "dark green", cex = 1.5)
text(x = rates$fp, y = (rates$tp + .05), labels = rates$cutoff)
text(x = .07, y = .97, labels = "Ideal Model")
text(x = 0, y = .05, labels = 1)
text(x = 1, y = .95, labels = 0)
abline(v = 0, h = 1, lty = 2, lwd = 2)
text(.5, .46, srt = 30, labels = "No Predictive Power")

At one extreme, a cutoff value of 1 (bottom left corner) means we treat everyone as a stayer. This would give us 0 False Positives (because no one is predicted to leave) but also 0 True Positives.

At the other extreme, a cutoff of 0 (upper right corner) means we call everyone a leaver. By categorizing everyone as a leaver, we would have a 100% True Positive rate but also a 100% False Positive rate (because every stayer is predicted to leave regardless of the model output).

The the middle values will give us a feel for the tradeoffs with changing cutoffs.

As we progress through the different cutoff values, we see that we move along a curve. Note that at each point of the curve we have a different False Positive and True Positive rate.

For example, with a cutoff of .5, we would have a False Positive Rate of roughly 18% (x axis) and a True Positive Rate of 42% (y axis). By contrast, a cutoff of .3 gives us a roughly 50% False Positive rate and an 80% True Positive rate.

Ideally, we would like to see our curve shoot up sharply with small changes in the cutoff value. That is, we would like to see a huge increase in catching the True Positives while limiting the number False Positives.

The absolute ideal (if unacheivable) reference point is the green dot which represents perfect prediction: all True Positives and no False Positives.

We want to get our ROC curves close to that dotted line shape when evaluating and developing our models. We also want our chosen cutoff value along that ROC curve to be as close to that green dot as possible.

The diagonal line stretching across the figure represents a model with no predictive power.

The better your model, the closer it gets to that ideal step-like dotted line. The worse your model, the closer it gets to the diagonal.

Logistic Regression ROC Curve

Now that you understand the principles of thresholds and their impact on False Positive and True Positive rate, let’s build complete ROC curves using our models.

The only difference between these complete curves and the simplified example above is that we will test every possible threshold we can using the probability values produced by our model. Those probabilities produced by our models represent the possible cutoff values.

Fortunately, some smart people came up with some functions to make the calculations and plots easy.

I have created the function that uses the basic ROCR set up but added a few other details for this primer.

We’ll start with our m2 model from the logistic regression along with our test dataset and the specific outcomes.

# Function to get the best cutoff point
# from https://www.r-bloggers.com/a-small-introduction-to-the-rocr-package/

opt.cut <- function(perf, pred){
    cut.ind <- mapply(FUN=function(x, y, p){
        d <- (x - 0)^2 + (y-1)^2
        ind <- which(d == min(d))
        c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
            cutoff = p[[ind]])
    }, perf@x.values, perf@y.values, pred@cutoffs)
}

# Function for the ROC Curve: http://rocr.bioinf.mpi-sb.mpg.de/

visROC <- function(model, data, outcome){
    
    # Core three steps from ROCR
    temp_pred <- predict(model, newdata= data, type = "response")
    roc_pred <- prediction(temp_pred, data[,outcome]) #create an s4 ROC object
    roc_perf <- performance(roc_pred, "tpr", "fpr")
    
    # make the plot
    plot(roc_perf, colorize = TRUE, print.cutoffs.at = seq(0,1,.1), 
         main = "ROC Curve", lwd = 2)
    abline(coef = c(0,1), col = "black", lwd = 2)
    
    # get the optimum cutoff
    opt <- opt.cut(roc_perf, roc_pred)
    points(x = 1-opt[2], y = opt[1], pch = 19, col = "red", cex = 1.5)
    text(x = 1-opt[2],  y = opt[1] + .05, labels = "Optimum Cutoff")

    
    # Area Under the Curve
    text(x = .6, y = .3, label = paste("Area Under the Curve:\n", 
                                       round(as.numeric(performance(roc_pred, "auc")@y.values), 2)))
    
    text(x = .6, y = .15, label = paste("Optimum Cutoff:\n",  round(opt[3],3)))
}

# Running the functions
visROC(m2, test, "vol_leave") 

The output should be familiar: False Positive on the x axis, True Positive on the y axis, and a curve capturing the shifts in False Positive and True Positive for differing cutoff values. Note that .7, .3, and .5 values we tested above are in the exact same place, but now we can see values for the whole continuum of predicted probabilities.

The color-coded line tells you what the cutoff values are at each point along the curve.

Interpreting an ROC Curve

There are three basic things to learn from this plot.

  1. The model is definitely better than the no information reference point because it curves substantially above the diagonal line
  2. The best possible cutoff value is .364. That value is the point along the curve that is closest to the ideal (0,1) location in the upper left.

To get the best tradeoff between False Positives and True Positives, we would categorize everyone below .364 as a “0” (stayer) and everyone at or above it as 1 (leaver).

  1. The Area Under the Curve (AUC). The perfect model has an AUC of 1 and a non-predictive model hugging the diagonal line would have an AUC of .5 (because it cuts our space in half).

I am giving the AUC a very limited treatment here but AUC, not accuracy, is generally the favored metric for model evaluation.

Opinions vary on what a “good” AUC is but the following from http://gim.unmc.edu/dxtests/roc3.htm can serve as a broad guideline:

  • 90-1 = excellent
  • 80-.90 = good
  • 70-.80 = fair
  • 60-.70 = poor
  • 50-.60 = fail

Our logistic regression model falls into the fair (read: “Not that bad”) category.

Decision Tree ROC Curve

Let’s see what the ROC curve looks like for our decision tree model. The DT model output is structured slightly differently so I’ve modified the function just a bit and renamed it.

visTreeROC <- function(model, data, outcome){

    # Core three steps from ROCR
    temp_pred <- predict(model, newdata= data, type = "prob")
    roc_pred <- prediction(temp_pred[,2], data[,outcome]) #create an s4 ROC object
    roc_perf <- performance(roc_pred, "tpr", "fpr")
    
    # make the plot
    plot(roc_perf, colorize = TRUE, print.cutoffs.at = seq(0,1,.1), 
         main = "ROC Curve", lwd = 2)
    abline(coef = c(0,1), col = "black", lwd = 2)
    
    # get the optimum cutoff
    opt <- opt.cut(roc_perf, roc_pred)
    points(x = 1-opt[2], y = opt[1], pch = 19, col = "red", cex = 1.5)
    
    # Area Under the Curve
    text(x = .6, y = .3, label = paste("Area Under the Curve:\n", 
                                       round(as.numeric(performance(roc_pred, "auc")@y.values), 2)))

    text(x = .6,  y = .15, labels = paste("Optimum Cutoff\n", round(opt[3],3)))
}

# Run the function
visTreeROC(fit, test, "vol_leave") 

An AUC of .63!?! How can our decision tree be so horrible?

The clue is in the figure itself. The ROC curve for the decision tree is not remotely smooth like the logistic regression ROC curve. Instead it has just has a few sharp cuts and that’s it.

Why is it not a smooth curve?

Remember that we are plotting False Positive v. True Positive for each of the possible thresholds produced from our model predictions. The key difference here is that the logistic regression produced a broad continuum of predicted values that we could use as a threshold.

Our tree model, on the other hand, only produced 4 probability values, one for each of the possible buckets in our final model.

If we look at our pretty tree plot again, we see employees can only go into four possible buckets; their individual probabilities of leaving are set by the probability of leaving for their specific bucket. There are only 4 buckets so we only have 4 possible cutoffs to choose from.

# Our pretty tree model once more
par(mar = c(5,4,1,2)) #setting the margins
fancyRpartPlot(fit, sub = NULL, main = "Final Decision Tree")

# The unique probability values from our test set.
table(round(predict(fit, test, type = "prob")[,2], 3))
## 
## 0.298 0.397 0.576 0.668 
##  2547   295   224   634

Accordingly, looking at our predicted outcomes for our test set, there are only four unique predicted probability values. These values match those for the buckets in the figure.

# The unique probability values from our test set.
table(round(predict(fit, test, type = "prob")[,2], 3))
## 
## 0.298 0.397 0.576 0.668 
##  2547   295   224   634

This turns out to be reflected in our colorful ROC plot too, with only 4 distinct threshold colors represented.

Putting it all togehter see that while overall accuracy of our two models were similar, the ROC curve analysis revealed a clear and strong difference. Our logistic regression model is better.

This will not always be the case of course, but the example highlights the challenges present in evaluating even fairly basic models.

Final Thoughts

Predicting employee turnover is a hot topic and will likely be for some time. The intent of this primer was to outline the critical phases of developing employee turnover model and give you the R code you need to get started.

We’ve covered a ton here but this just the first step. Copy this code, play with it, expand on it, get a feel for your own data, and above all else, keep learning.

Google the things you don’t know, wiki everything, read as much as you can, talk to the data science types at your company (and of course leave comments or just email me with questions: john@hranalytics101.com).

In addition, be thoughtful throughout the entire process, think about your company, your co-workers, and the processes of individual and collective decision making.

Finally, recognize that modeling is not magic. It’s always hard work but the payoff for you and your organization can be realized sooner than you think.

Resources

You can’t swing your arms without hitting a predictive modeling post…but here are few resources that I think you will find particularly helpful.

Like this post?

Get our FREE Turnover Mini eCourse!

You’ll get 5 insight-rich daily lessons delivered right to your inbox.

In this series you’ll discover:

  • How to calculate this critical HR metric
  • How turnover can actually be a GOOD thing for your organization
  • How to develop your own LEADING INDICATORS
  • Other insightful workforce metrics to use today

There’s a bunch more too. All free. All digestible. Right to your inbox.

Yes! Sign Me Up!



Comments or Questions?

Add your comments OR just send me an email: john@hranalytics101.com

I would be happy to answer them!

Leave a Reply

Your email address will not be published.