When Models Lie: Why Plotting Your Data Is Super-Duper Critical

 

In HR analytics, a plot is worth a thousand models.

Today’s post is based on the work of F.J. Anscombe and it screams one simple thing: always plot your data first.

The Data

We’ll start by loading the Anscombe dataset that comes with R.

# library(help = "datasets") # for a complete list of other datasets available

data(anscombe)

The Models All Say the Same Thing…

Next, we’ll do the very thing that you should NEVER do: run models without first plotting your data.

For this, we’ll just use the handy code included in the help for anscombe. I’ve pasted it below with a few slight modifications. Note: If you find this R code hard to understand, just skip to the end of this post for code that is much easier to follow if you are just starting out in R.

# help(anscombe) # for the basic code I used here. 

# looping through and creating a set of regression models
# I have provided simpler code at the end if you are new to r
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  print(anova(lmi))
  print(paste("r squared:", round(summary(lmi)$r.squared, 3)))
}
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.667"
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.666"
## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.666"
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.667"

If you are familiar with regression, you’ll notice that the results across our four linear regression models are all virtually identical.

If you are not, then just observe that all the F-values and p-values are essentially the same and they all uniformly signal that there is a statistically significant relationship between x and y. Moreover, that R-squared value you see tells us that our x values are accounting for 2/3 of the differences in y. That’s REALLY good.

The intercepts and coefficients are also the same. This means that the regression lines themselves are absolutely 100% the same.

More evidence that there is nothing to see here.

## See how close they are (numerically!)
sapply(mods, coef)
##                   lm1      lm2       lm3       lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091
lapply(mods, function(fm) coef(summary(fm)))
## $lm1
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
## 
## $lm2
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
## 
## $lm3
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
## 
## $lm4
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

By the normal, quick and dirty modeling approaches often used in HR Analytics and elsewhere, we have four more or less identical linear regression models with significant predictors and high R-squareds.

…But a Plot is Worth a Thousand Models

Lo and behold! When we actually PLOT the data, we see an entirely different picture.

## Now, do what you should have done in the first place: PLOTS

par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
  abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)

If we just did our models and trusted the numbers by themselves, we would’ve walked away feeling pretty good.

But while the model results capture the essence of the first plot, they fail to communicate the real story for the others.

Whether it’s the curvilinear relationship in plot 2, the outlier in plot 3, or the restricted values in our X variable in plot 4, the pictures tell the true tale.

Summary

There is no reason to belabor the point. Whether you are just reporting basic means for monthly turnover, producing some bar graphs summarizing employmee movement, or using models to predict future talent sources, it’s CRITICAL that you plot your data first.

Appendix: Simpler Code Modeling and Plotting

Sometimes R people like to make compact code that is elegant but harder to understand. The code below does the same job as that above but just piece by piece. This is always the best way to learn R or any other language for that matter. You can always get more complicated later (although I usually don’t).

data(anscombe) # loading data 
a <- anscombe #putting into a for easier reference
names(a) # just so you can see the names
## [1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"
# Four Separate Models and Summaries
mod1 <- lm(y1 ~ x1, data = a)
summary(mod1)
## 
## Call:
## lm(formula = y1 ~ x1, data = a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
mod2 <- lm(y2 ~ x2, data = a)
summary(mod2)
## 
## Call:
## lm(formula = y2 ~ x2, data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
mod3 <- lm(y3 ~ x3, data = a)
summary(mod1)
## 
## Call:
## lm(formula = y1 ~ x1, data = a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
mod4 <- lm(y4 ~ x4, data = a)
summary(mod1)
## 
## Call:
## lm(formula = y1 ~ x1, data = a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
# Four Separate Plots
# setting up 4 plots on a single screen and the margins
par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))

# Plot 1
plot(a$x1, a$y1, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod1) # adding the regression model line

# Plot 2
plot(a$x2, a$y2, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod2) # adding the regression model line

# Plot 3
plot(a$x3, a$y3, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod3) # adding the regression model line

# Plot 4
plot(a$x4, a$y4, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod4) # adding the regression model line

# Adding text at the top
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)

 

Like this post?

Get our FREE Turnover Mini Course!

You’ll get 5 insight-rich daily lessons delivered right to your inbox.

In this series you’ll discover:

  • How to calculate this critical HR metric
  • How turnover can actually be a GOOD thing for your organization
  • How to develop your own LEADING INDICATORS
  • Other insightful workforce metrics to use today

There’s a bunch more too. All free. All digestible. Right to your inbox.

Yes! Sign Me Up!



Comments or Questions?

Add your comments OR just send me an email: john@hranalytics101.com

I would be happy to answer them!

Leave a Reply

Your email address will not be published.