Tutorial: How to Assess Model Accuracy

You have just completed a model that predicts with 70% accuracy whether someone will or will not apply for a job internally in the next 6 months. Is that any good?

Figuring out whether such two-class predictive models are “any good” is a key question. Unfortunately, the answer is not always clear. This can be particularly frustrating for those just getting started with predictive analytics in HR.

In today’s tutorial, we’ll go step by step to take the some of the mystery (and pain) out of interpreting model accuracy for two-class prediction models..

 

What You Will Learn

  • How to create and use a confusion matrix
  • How to use the No Information Rate as a reference point
  • How to create and use an ROC Curve
  • How to create and use a Lift Chart

Setting the Stage

To keep things simple we’ll focus on prediction with the two-class problem (e.g. apply v. not apply). For many HR analytics questions, this is all you’ll need.

Note that large parts of this discussion are derived from a previous extended tutorial on predictive modeling for employee turnover but have been pulled out and simplified just a bit to focus specifically on the issue of model accuracy. In addition, this post introduces you to lift charts, something that was not covered in the extended tutorial. If you are interested in learning about model development itself, you will want to check out that extended tutorial.

As with just about everything on HRanalytics101, we’ll use R throughout. Not an R user? No worries! You will still be able to pick up the key concepts and successfully apply them with your software of choice.

The Data

Let’s suppose we are trying to predict whether someone will or will not apply for another internal job within the next 6 months.

Knowing which employees are likely to apply for other internal roles can help us spot potential career development needs/frustrations, anticipate future external hiring needs, help with internal sourcing, and predict employee movement dynamics.

We’ll start with some simulated model output and outcomes so that everyone can play along at home by running the scripts. You can also just click here to download a csv version: model_accuracy_dataset.

Note on the libraries: For those just getting started with R, there are several packages that you may need to be installed here. To do that you can either use RStudio installer button and start typing the name of the package you need or you can just run the individual lines of code for the packages you need as in the following format: install.packages("package_name").I have included those individual lines in the code below. Remove the ‘#’ to run the line if you have never installed the packages before. Once you have the libraries installed, you can then just use the “library” statements to load those libraries and access their functions.

# Note: Only need to install packages one time
# install.packages("caret")
# install.packages("e1071")
# install.packages("pROC")
# install.packages("lift")
library(caret) # for confusion matrix function
library(e1071)
library(pROC)
library(lift) ## Loading required package: lattice
## Loading required package: ggplot2 set.seed(42)

app <- sample(c("no","yes"), size = 1000, replace = T, prob = c(.7, .3))

#setting up empty vector for simulated model 1 results
m1 <-  numeric(1000)

# Filling in model 1 results
for (i in seq_along(app)){
    if(app[i] == "yes"){
        m1[i] <- rnorm(n = 1, mean = .6, sd = .15)
    }else{
        m1[i] <- rnorm(n = 1, mean = .45, .15)
    }
}

m1 <- ifelse(m1 > 1, .99, ifelse(m1 < 0 , .01, m1))

#setting up empty vector for simulated model 2 results
m2 <- numeric(1000)

# Filling in model 2 results
for (i in seq_along(app)){
    if(app[i] == "yes"){
        m2[i] <- rnorm(n = 1, mean = .6, sd = .1)
    }else{
        m2[i] <- rnorm(n = 1, mean = .4, .1)
    }
}

app <- factor(app, levels = c("yes", "no")) #set the level order

m2 <- ifelse(m2 > 1, .99, ifelse(m2 < 0 , .01, m2))

d <- data.frame(emp = 1:1000, m1_pred = m1, m2_pred = m2, app)

table(d$app) # quick check of the data
## 
## yes  no 
## 293 707

Our dataset has four variables: an employee number (emp), predictions from our two “models”” (m1, m2), and whether they applied for an internal role or note (app). We won’t need employee number in this example today, but it’s a good habit to include such unique identifiers for HR analytics work.

You should interpret the model outputs from m1 and m2 as the predicted probability of applying for an internal role. This kind of output is consistent with that typically generated from all kinds of categorical predictive models such as logistic regression, decision trees, or neural networks.

Remember that in the real world most of our work with the methods here will be applied to your “test” set data rather than the data used to develop the models themselves. This helps us avoid overstating the quality of our models.

We’re focusing just on interpreting model output and accuracy so we’ll skip over the topic of data splitting today, but those new to modeling and interpretation should be aware of this critical distinction and the many ways this issue is handled.

Confusion Matrix and the No Information Rate

The confusion matrix is perhaps the most fundamental tool in assessing a two-category prediction model. The purpose of the confusion matrix is to compare the predictions from your model with the known outcomes.

To create a confusion matrix, you first need to select a probability threshold in which values exceeding the threshold will be counted as an event (here, applying for a internal role) or a non-event (not applying for a role)

In this first example, we’ll set the cutoff to .5. This means that anyone with a predicted probability of .5 or higher will be predicted to apply for an internal role.

Let’s create our first confusion matrix using a simple table and some margin totals.

Please note that I always recommend doing a very basic confusion matrix by hand first to be sure you understand your data. This includes the order of your factor levels. Carrying out some of these basic operations by hand will save you time and trouble later because you will be able to identify and correct any issues arising from the automated functions provided by specialized R packages.

#set the threshold
thresh <- 0.5

# predictions using that threshold
# note explicit ordering of levels
out <- factor(ifelse(d$m1_pred > thresh, "yes", "no"), levels = c("yes", "no"))

#basic confusion matrix
addmargins(table(out, d$app)) 
##      
## out    yes   no  Sum
##   yes  221  255  476
##   no    72  452  524
##   Sum  293  707 1000
prop.table(table(d$app)) # no information rate
## 
##   yes    no 
## 0.293 0.707

The rows represent the model predictions using the .5 threshold and the columns represent the real known outcomes.

Let’s first measure our model’s accuracy. To do this, we’ll sum up the two values on the diagonal. These represent the places where the model and the result are the same (“yes” with “yes” and “no” with “no”).

Then we’ll divide the result by the total number of observations (here, 1000). Doing this, we get 673/1000 or an accuracy of 67.3%. At first glance, this might not seem bad but we are overlooking something critical: the No Information Rate.

The No Information Rate

The No Information Rate is your best guess given no information beyond the overall distribution of the classes you are trying to predict. In this case, we know from our table that most our applicants (70%) did not apply for an internal role. So our best guess with no other information is to pick the majority class.

If we just pick the majority class, we will be correct 70% of the time. With the No Inforamtion Rate in mind, we now see that our model’s accuracy of 67.3% is downright lousy. Our predictions are actually mildly worse using this model (with a .5 threshold anyway…more on this later) instead of just sticking with the majority class as our best guess.

As an aside, remember the No Information Rate when hearing about predictive analytics products (internal or external) that cite high accuracy as a selling point. A model with a 90% predictive accuracy sounds great but at the very least, you need to know the No Information Rate to know whether the touted model is actually doing anything useful for the particular outcome it claims to predict. If the majority class represents 90% of the group, that model with 90% accuracy does nothing for you.

The Confusion Matrix Function

Now that you know how to create a confusion matrix by hand and understand the No Information Rate, let’s create a confusion matrix using the confusionMatrix function from the caret package. We’ll stick with the .5 threshold for now.

thresh <- 0.5
out <- factor(ifelse(d$m1_pred > thresh, "yes", "no"), levels = c("yes", "no"))
confusionMatrix(out, d$app)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes 221 255
##        no   72 452
##                                          
##                Accuracy : 0.673          
##                  95% CI : (0.6429, 0.702)
##     No Information Rate : 0.707          
##     P-Value [Acc > NIR] : 0.9912         
##                                          
##                   Kappa : 0.3327         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7543         
##             Specificity : 0.6393         
##          Pos Pred Value : 0.4643         
##          Neg Pred Value : 0.8626         
##              Prevalence : 0.2930         
##          Detection Rate : 0.2210         
##    Detection Prevalence : 0.4760         
##       Balanced Accuracy : 0.6968         
##                                          
##        'Positive' Class : yes            
## 

Let’s just focus on a few aspects of this rich output. First, notice that the table here is essentially what we created when we did things by hand. Second, the Accuracy rate for our is precisely what we calculated earlier. Finally, the No Information Rate matches what we calculated earlier.

Be sure to check these values first to be sure the inputs and outputs are appropriately structured. Too many people skip these basic quality assurance sorts of steps but skipping QA is always a bad idea.

Now let’s use this same function to see how well our second model does. We’ll stick with the .5 threshhold cutoff again for now.

thresh <- 0.5
out <- factor(ifelse(d$m2_pred > thresh, "yes", "no"), levels = c("yes", "no"))
confusionMatrix(out, d$app)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes 252 113
##        no   41 594
##                                           
##                Accuracy : 0.846           
##                  95% CI : (0.8221, 0.8678)
##     No Information Rate : 0.707           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6532          
##  Mcnemar's Test P-Value : 1.057e-08       
##                                           
##             Sensitivity : 0.8601          
##             Specificity : 0.8402          
##          Pos Pred Value : 0.6904          
##          Neg Pred Value : 0.9354          
##              Prevalence : 0.2930          
##          Detection Rate : 0.2520          
##    Detection Prevalence : 0.3650          
##       Balanced Accuracy : 0.8501          
##                                           
##        'Positive' Class : yes             
## 

Summing our table diagonals (252 + 594) and dividing by 1000 gives us an accuracy of .846 (84.6%), consistent with output. This definitely beats the old model and, most importantly, the No Information Rate of 70.7%. If we stopped our analyses right here, we would at least have a somewhat useful model.

True Positives, False Positives, and the ROC curve

At this point, you are probably wondering about that threshold value. Starting with .5 makes intuitive sense, but can’t we do better than simple intuition? That answer is a resounding “Yes!” and to do this we’ll move to the ROC curve.

The key idea underlying the ROC is the tradeoff between True Positives and False Positives.

The True Positive rate is the percentage of True events (here, applying for an internal role) that our model successfully predicts. Let’s take a look at our results for m2 again.

thresh <- 0.5
out <- factor(ifelse(d$m2_pred > thresh, "yes", "no"), levels = c("yes", "no"))
confusionMatrix(out, d$app)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes 252 113
##        no   41 594
##                                           
##                Accuracy : 0.846           
##                  95% CI : (0.8221, 0.8678)
##     No Information Rate : 0.707           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6532          
##  Mcnemar's Test P-Value : 1.057e-08       
##                                           
##             Sensitivity : 0.8601          
##             Specificity : 0.8402          
##          Pos Pred Value : 0.6904          
##          Neg Pred Value : 0.9354          
##              Prevalence : 0.2930          
##          Detection Rate : 0.2520          
##    Detection Prevalence : 0.3650          
##       Balanced Accuracy : 0.8501          
##                                           
##        'Positive' Class : yes             
## 

Looking down that first column, we have a total of 293 (252 + 41) observed application events. Our model successfully predicted 252 of these 293 events, giving us a True Positive rate of 252/293 or .86. In statistical jargon, the True Positive rate is also referred to as “Sensitivity”. This Sensitivity measure is provided in the output from the confusionMatrix function.

It’s great to catch those events, but we also predicted that some people would apply for an internal role when in fact they did not.

We therefore need to also look at the False Positive rate when assessing the quality of our model. As the name suggests, the False Positive rate is the proportion of “no” individuals who were incorrectly predicted to be a “yes”.

To get this information, we’ll look at the second column and observe that 707 people (113 + 594) did not apply for an internal role. Of these, 113 were incorrectly labeled as “yes”. Thus, we have a False Positive rate of 113/707 or 0.16.

The False Positive Rate (FPR) is captured in our function output by a measure called “Specificity”. Specificity is just 1-FPR. Thus, our False Positive Rate of .16 gives us a Specificity of 1-.16, or .84.

Lessons from Extreme Thresholds

To develop our intuitions on the impact of threshold and its relationship to True Positives and False Positives, let’s fold in the extreme cases where the threshold is 0 or 1 along with our .5 value. We would never use a 0 or 1 for our threshold in practice but it’s an instructive exercise.

We’ll start with a threshold of 0. With a threhold of 0, we are predicting “yes” for everyone.

thresh <- 0
out <- factor(ifelse(d$m2_pred > thresh, "yes", "no"), levels = c("yes", "no"))
confusionMatrix(out, d$app)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes 293 707
##        no    0   0
##                                           
##                Accuracy : 0.293           
##                  95% CI : (0.2649, 0.3223)
##     No Information Rate : 0.707           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.000           
##             Specificity : 0.000           
##          Pos Pred Value : 0.293           
##          Neg Pred Value :   NaN           
##              Prevalence : 0.293           
##          Detection Rate : 0.293           
##    Detection Prevalence : 1.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : yes             
## 

This gives us a True Positive rate of 1. Of course, we are getting a False Positive rate of 1 too (and a Specificity of 0) because we also labeled all non-appliers with a “yes”. In essence, we have chosen a “a minority class model” (versus the no information “majority class model”); everyone is predicted to be a “yes” even though the “yes” group is not the majority class in this instance.

In the middle of these extreme thresholds, we have our .5 threshold. As we saw earlier, this gave us a True Positive (Sensitivity) rate of .86 (86%) and a False Positive Rate of .16 (16%). Here, we see some decrease in the True Positive rate (from 1 to .86) but our False Positive rate plummets from 1 to .16.

Finally, we’ll go the other direction now and see what happens with a threshold 1. This corresponds to a “majority class” model because we are predicting that no one will apply for an internal role.

thresh <- 1
out <- factor(ifelse(d$m2_pred > thresh, "yes", "no"), levels = c("yes", "no"))
confusionMatrix(out, d$app)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes   0   0
##        no  293 707
##                                           
##                Accuracy : 0.707           
##                  95% CI : (0.6777, 0.7351)
##     No Information Rate : 0.707           
##     P-Value [Acc > NIR] : 0.5158          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.707           
##              Prevalence : 0.293           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : yes             
## 

Predicting “no” for everyone yields a False Positive rate of 0 (and a Specificity of 1). It also gives us a True Positive rate of 0 (because we are predicting “no” for everyone). This is all bad.

Visualizing the Thresholds

We can summarize these shifts with a very basic plot for the three selected thresholds (1, .5, and 0) and their corresponding False Positive and True Positive values.

fp <- c(0, .16, 1)
tp <- c(0,.86, 1)

plot(fp,tp, pch = 19, col = "red", xlab = "False Positive Rate", 
     ylab =  "True Positive Rate", main = )
lines(fp,tp, col = "red")
xadj <- c(.02, 0, -.04)
yadj <- c(0.02,.03, -.05)
text(x = fp + xadj, y = tp + yadj, 
     labels = c("Threshold\n1", "Threshold\n.5", "Threshold\n0"))
abline(v = 0, lty = 2)
abline(h = 1, lty = 2)
text(.06, .97, labels = "Ideal Model")
points(0,1, pch = "O", cex = 1.5)

Starting in the lower left corner with a threshold of 1, we see we have 0 False Positives but also 0 True Positives. With a threshold of .5, we see that slight increase in False Positives but a dramatic increase in True Positives. This is exactly the kind of tradeoff one might hope to find when modeling: a big predictive gain with a limited loss on the other end.

Finally, with our threshold at 0, we see a mild increase in True Positives but a huge increase in False Positives.

The ideal (if unattainable) fit is in the upper left corner. As curves for a given model approach that upper left corner, they gain in the True Positive detection while minimizing False Positives.

What have we learned? As we shift our threshold, we see tradeoffs between the True Positives and False Positives. Increasing one typically means increasing the other to some degree.

The trick then is find the best threshold value that balances these outcomes. The figure that we built above to illustrate these tradeoffs turns out to be a a crude ROC curve. This is essentially what we will use to find our best threshold and thus move beyond simple intuitions when choosing our threshold.

The ROC Curve

As I have noted in the extended primer on building predictive models for turnover, ROC stands for “Receiver Operating Characteristic” and was first used in WWII to aid radar detection processes. It is still used today in human cognition research, radiology, epidemiology, and predictive models of all sorts.

ROC curves help us evaluate binary classifiers by showing us how different cutoff (threshold) values impact the quality of our model.

In the previous figure, we used just three threshold values to create a very rough ROC curve. The only difference between this crude version and the full-blown ROCs below is the number of possible cutoffs tested.

Using the roc function from the pROC package, we’ll build a full ROC curve by testing the True Positive and False Positive values at each possible threshold generated by our model output probabilities.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_curve   <- roc(response = d$app, predictor = d$m2_pred)
plot(roc_curve, print.thres = "best")
## 
## Call:
## roc.default(response = d$app, predictor = d$m2_pred)
## 
## Data: d$m2_pred in 293 controls (d$app yes) > 707 cases (d$app no).
## Area under the curve: 0.9268
abline(v = 1, lty = 2)
abline(h = 1, lty = 2)
text(.90, .97, labels = "Ideal Model")
points(1,1, pch = "O", cex = 1.5)

Specificity (1-FPR) is on the X-axis and the True Positive rate (Sensitivity) is on the y axis. The pROC result shows us that the point along the curve that gets us closest to the ideal model in the upper left corner has a threshold of .5. Therefore, we should choose .5 as our preferred threshold.

The respective Specificity and Sensitivity values are also conveniently provided by the pROC function along wit a diagonal reference line representing a model that has no predictive power.

Please note that the match between the actual best threshold here and our simple starting point of .5 earlier is just a coincidence; other models and prediction patterns will generate different ideal thresholds.

AUC: Area Under the Curve

Finally, it is important to note that the pROC function also generates a measure known as the Area Under the Curve (AUC). The maximum AUC possible is 1 (i.e. a model whose ROC whose follows the dotted line). The result from our current “model” of .9268 is therefore extremely good.

Now let’s take a look at our other model, m1.

roc_curve   <- roc(response = d$app, predictor = d$m1_pred)
plot(roc_curve, print.thres = "best")
## 
## Call:
## roc.default(response = d$app, predictor = d$m1_pred)
## 
## Data: d$m1_pred in 293 controls (d$app yes) > 707 cases (d$app no).
## Area under the curve: 0.7726
abline(v = 1, lty = 2)
abline(h = 1, lty = 2)
text(.90, .97, labels = "Ideal Model")
points(1,1, pch = "O", cex = 1.5)

There are a couple of things to notice here. First, in comparison to the m2 ROC result, this curve is further away from that upper left ideal although it is still healthily above the diagonal line.

Second, we see a corresponding drop in the AUC from .93 to .77. In the real world, this result might not be too shabby but it’s a far cry from our nearly (and improbably) perfect m2 result. Finally, note the ideal threshold of .55.

Summary Thoughts on the ROC

In sum, when evaluating models for two-class (binary) categorical prediction, we want models that deviate from the diagonal line, approach the ideal dotted line, and have a correspondingly improved AUC. Remember though, that simply choosing the model with the highest overall AUC might not always be the right move. For example, one might wish to concentrate on just certain portions of the curve where performance improvement is most rapid.

In other cases, we may use the probabilties themselves to determine whether we need further testing/ investigation (as in medical testing or fraud detection) with the understanding that hits, misses, correct rejections, and false alarms are all associated with different costs and benefits. ROC curves can be very helpful but appropriate application and interpretation is situation dependent.

The Lift Chart

The final consideration for assessing our two-class model is the Lift Chart. The “lift” of a model is the number of events that our model detects in a sample above that from a completely random selection of samples (p. 265, Kuhn and Johnson, 2013).

Lift: The Basics

Let’s calculate a small lift chart by hand to make that definition a tad less cryptic. We’ll focus first on actual counts instead of percentages in the example.

  1. To simplify this example, we’ll just start with a random sample of 40 rows from our data. Looking at just 40 data points makes it easier to see that is going on. Normally, we would use all of the predictions for the full version of a lift measure but it’s easier to visualize what we are doing if we make the dataset more manageable here.

Using the table function, we see that we have a total of 16 possible “yes” in our sample of 40.

set.seed(42) # setting random seed for replication
sam <- d[sample(nrow(d), 40), c("m1_pred", "app")]
table(sam$app)
## 
## yes  no 
##  16  24
  1. Next, we sort the data by predicted probability of one of our models in decreasing order. Let’s go with m1 predictions in this example.
ind <- order(sam$m1_pred, decreasing = T) #index for sorting
sam <- sam[ind,]
print(sam)
##       m1_pred app
## 134 0.7758364 yes
## 652 0.7594468 yes
## 712 0.7357596 yes
## 812 0.7148952 yes
## 467 0.7058970  no
## 640 0.6886482 yes
## 733 0.6786252 yes
## 699 0.6506757 yes
## 435 0.6269487 yes
## 200 0.6250761 yes
## 4   0.6204014 yes
## 828 0.6145043  no
## 926 0.6140090 yes
## 882 0.5945202 yes
## 116 0.5758004  no
## 716 0.5634722 yes
## 915 0.5600952 yes
## 456 0.5293092  no
## 502 0.5286183  no
## 804 0.5263179  no
## 550 0.5098102  no
## 937 0.4667987  no
## 588 0.4590034 yes
## 81  0.4580357  no
## 663 0.4562959  no
## 963 0.4540828  no
## 376 0.4539909 yes
## 786 0.4456876  no
## 924 0.4412594  no
## 968 0.4113695  no
## 253 0.3988950  no
## 286 0.3617853  no
## 136 0.3507013  no
## 886 0.3448995  no
## 381 0.3149862  no
## 925 0.3054867  no
## 8   0.3049566  no
## 454 0.2771569  no
## 517 0.2658731  no
## 873 0.2099095  no

Take a peak at our sorted data for a second. Notice that most of our “yes” instances occur at the top of the table where we have higher predicted probabilities. By contrast, we have few “yes” outcomes at the bottom. This is our first clue that there is indeed some relationship between the m1 predictions and actual “yes” events in our data.

  1. Create another column that creates a running, cumulative sum of the “yes” events and then plot it.
sam$tot <- cumsum(sam$app == "yes") #cumulative sum

plot(sam$tot, type = "l", lwd = 2, col = "blue", xaxt = "n", yaxt = "n",
     main = "Sorted Cumulative Total \"Yes\" Events", 
     xlab = "Number of Observations", ylab = "Total \"Yes\" Events Found")
segments(17,0, 17, 14)
segments(0,14, 17, 14)
axis(1, 1:40, 1:40)
axis(2, 1:16, 1:16)

What do you notice? Even though we have a a total of 40 observations in our sample, 14 of the 16 possible “Yes” events arise in the first 17 observations of our sorted list. Most of the action is occurring at the top of our list where the predicted probabilities are highest.

In practical HR Analytics terms, if you wanted to focus your developmental energies on those likely to apply for an internal role (and you were looking at all of your data), you would therefore want to focus on those individuals at the top of our list.

  1. Compare our sorted cumulative total to the cumulative total expected by chance.

By comparing our cumulative total of observed “yes” events sorted by probability v. the cumulative total “yes” events we would expect by pure chance (that is, without sorting by predicted probability), we can see how much “lift” our model gives us over and above what we would expect to observe by chance.

In particular, our no-information (i.e. chance) model should follow a 45 degree line. In this case, we have 16/40 or .4 “yes” events for every observation. With a base rate of 40% in our current sample, we would therefore expect 4 “yes” events in our first 10 observations, another 4 in our second set of 10 observations and so on. If we plot our model results with that expected by the base rate, we can visualize the predictive “lift” we get from using our model.

plot(sam$tot, type = "l", lwd = 2, col = "blue", xaxt = "n", yaxt = "n",
     main = "Sorted Cumulative Total \"Yes\" Events v. Chance", ylim = c(0,16),
     xlab = "Observations", ylab = "Total \"Yes\" Events")
axis(1, 1:40, 1:40)
axis(2, 1:16, 1:16)
abline(a = 0, b = .4, lwd = 2, col = "gray")

Lift: Using Percentages

Instead of counts, now let’s look at “lift” in terms of percentage as is often done. The x-axis would therefore represent the percentage of the sample tested and the y-axis the percentage of possible “yes” events observed. The only things that are changing here really are the axis labels.

plot(sam$tot, type = "l", lwd = 2, col = "blue", xaxt = "n", yaxt = "n",
     main = "Sorted Cumulative Total \"Yes\" Events v. Chance", ylim = c(0,16),
     xlab = "% Observations", ylab = "% \"Yes\" Events")

xlabels <- paste(seq(0, 40, by = 10)/40*100, "%", sep = "")
axis(1, seq(0, 40, by = 10), xlabels)

ylabels <- paste(seq(0, 16,2)/16*100, "%", sep = "")
axis(2, seq(0, 16,2), ylabels)
abline(a = 0, b = .4, lwd = 2, col = "gray")

Congratulations! You now understand the basic concepts underlying a lift chart.

But where is the “lifty” part?

Calculating the “Lifty” Part

Just looking at our sorted table, we can see that a set of those appearing at the top of the list are much more likely to have a “yes” than a random sample of people where the base rate is 40%.

For example, if we look at just the first 10 people in the sorted list, we get 9 “yes” event v. an expected value of 4 given the base rate of 40%.

To calculate our lift we just divide what we get from our given bucket (here the first 10 people) versus what we would expect by chance. This gives us a lift of 9/4 = 2.25. Said differently, the event rate for our top 10 is 2.25 times that of our base rate.

#converting our yes/ no to 1s and 0s for easy summing
sam$app_0_1 <- ifelse(sam$app == "no", 0, 1) 

# dividing yes events in top 10 by the expected number of 4 (given 40% base rate)
sum(sam[1:10, "app_0_1"])/4
## [1] 2.25

If we look at our second bucket of 10, we get 5 “yes” versus 4 expected by chance. This gives us a lift of 5/4 = 1.25 for that second bucket.

We can do the same for any number of buckets we care too, with larger samples affording a greater number of buckets.

# sum of second bucket divided by the expected number of 4
sum(sam[11:20, "app_0_1"])/4
## [1] 1.25

The Lift Package

Now that you have plowed through it by hand, let’s make it easier from here on out by using the lift function from the aptly named “lift” package in R. Note that your labels need to be 0s and 1s. We’ll use our sample of 40 people for the purposes of instruction but remember that we would normally use all of our test sample data and the corresponding predictions, not a sample as we have done here.

# install.packages("lift") # installing the package
library(lift)

plotLift(predicted = sam$m1_pred, labels = sam$app_0_1, cumulative = FALSE, n.buckets = 4, 
         ylim = c(0,3), main = "Lift Chart: Non-Cumulative")

#Confirming that we get the same bucket 1 and 2 values as I showed earlier

points(1,2.25, col = "red", pch = 19)
text(1.2, 2.4, labels = "Bucket 1 \n Lift")
points(2,1.25, col = "red", pch = 19)
text(2, 1.5, labels = "Bucket 2 \n Lift")

In the above, we used a non-cumulative lift measure (where where each bucket is independent of the other).

However, we can also use a cumulative lift measure by changing the argument value below. The first bucket in the cumulative version has the same value as before (9/4 = 2.25). The second bucket though, represents the cumulative lift over the first 2 buckets together (9 + 5 = 14). Dividing this by 8 (20 people with a base rate of 40% = 8) gives us a sort of running measure of collective lift.

plotLift(predicted = sam$m1_pred, labels = sam$app_0_1, cumulative = TRUE, n.buckets = 4, 
         ylim = c(0,3), main = "Lift Chart: Cumulative")

Final Thoughts

Well, we covered a ton of territory…frankly much more than I intended at the start. But when it comes to understanding whether our binary classifier model is any good, it pays to dig deep.

When thinking about model accuracy, start with the core methods outlined here. Make a confusion matrix and think about the costs and benefits of false positives and true positives. Make an ROC curve, gauge the impact of differing thresholds, and then compare that to the decision making and business contexts. Finally, consider adding the lift curve to your arsenal.

Ulimately our job is to make better people decisions through data. If we begin with this end in mind, we’ll understand our organizations better and increase the odds that our efforts and resources are directed where they ought to be.

References

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (pp. 265). New York: Springer.

Like this post?

Get our FREE Turnover Mini Course!

You’ll get 5 insight-rich daily lessons delivered right to your inbox.

In this series you’ll discover:

  • How to calculate this critical HR metric
  • How turnover can actually be a GOOD thing for your organization
  • How to develop your own LEADING INDICATORS
  • Other insightful workforce metrics to use today

There’s a bunch more too. All free. All digestible. Right to your inbox.

Yes! Sign Me Up!



Comments or Questions?

Add your comments OR just send me an email: john@hranalytics101.com

I would be happy to answer them!

2 Comments

  1. Anurag Gupta

    Hi John,

    I am unable to locate the dataset used for this analysis. It will be great if you can share the dataset as well. I am not able to understand m1 and m2 variable. What do they mean?

    Regards
    Anurag

    • john@hranalytics101.com

      Hi Anurag:
      Thanks for the note. I have added some additional code and notes about the libraries that you need install for running the entire script in case you or others had issues. To create the dataset and use it in R, just copy and paste the R code that I have provided in the post into your R script within RStudio and then run it. I have now also added a link for a csv version of the simulated model results. In addition, I have added some comments in the script to make the meaning of m1 and m2 clearer. They are just vectors holding the data for the simulated outputs. “m1” is just short for “model 1”, “m2” for “model 2”. Please let me know if you have any additional questions!

Leave a Reply

Your email address will not be published.