EPPS Math Coding Camp

Regression Diagnostics and Model Evaluation

Author

Dohyo Jeong

Published

August 13, 2024

Session 1: Regression Diagnostics and Model Evaluation

1.1 Introduction and Data Overview

Dataset: Boston from the MASS Package

The Boston dataset from the MASS package includes housing data from the Boston area, with variables such as crime rate, tax rates, and median home values.

Key variables:
- crim: Crime rate per capita by town.
- zn: Proportion of residential land zoned for large lots.
- indus: Proportion of non-retail business acres per town.
- chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- nox: Nitrogen oxides concentration.
- rm: Average number of rooms per dwelling.
- age: Proportion of owner-occupied units built before 1940.
- dis: Distances to Boston employment centers.
- rad: Index of accessibility to radial highways.
- tax: Property tax rate per $10,000.
- ptratio: Pupil-teacher ratio by town.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town.
- lstat: Percentage of lower status population.
- medv: Median value of owner-occupied homes in $1000s.

Code
# Load necessary packages and data
# install.packages("MASS")
library(MASS)
data("Boston")

# Display the first few rows of the dataset
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
Code
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  


1.2 Residual Analysis

Purpose and Importance

Residual analysis is a crucial step in regression diagnostics. By examining the residuals (the differences between observed and predicted values), we can identify patterns that might indicate model inadequacies, such as non-linearity, heteroscedasticity, or the presence of outliers.

Example: Residual Plots

Code
# Fitting a linear model
lm_boston <- lm(medv ~ lstat + rm + age, data = Boston)

# Residual plots
par(mfrow = c(2, 2))
plot(lm_boston)


Explanation:

  • Residuals vs. Fitted: This plot helps detect non-linearity, heteroscedasticity (non-constant variance), and outliers. Ideally, residuals should be randomly scattered around zero.
  • Normal Q-Q: This plot checks whether the residuals are normally distributed. If the normality assumption holds, the points should be straight.
  • Scale-Location: Also known as the Spread-Location plot, it checks for homoscedasticity. The residuals should spread equally along the range of fitted values.
  • Residuals vs. Leverage: This plot helps detect influential observations that might disproportionately affect the model. High leverage points that significantly influence the model will be flagged.

Interpreting Results:

  • Non-random patterns: In the Residuals vs. Fitted plot, patterns such as a funnel shape suggest heteroscedasticity, while curvature suggests non-linearity.
  • Outliers or Influential Points: In the Residuals vs. Leverage plot, points with high leverage and large residuals could be influential.

1.3 Checking Assumptions

Linearity, Normality, and Homoscedasticity

Linearity:
The relationship between predictors and the response variable should be linear.

Normality:
The residuals should be normally distributed. The Normal Q-Q plot helps assess this.

Homoscedasticity:
The variance of the residuals should be constant across all levels of the predictors.

Linearity and Homoscedasticity

Code
# Checking for linearity and homoscedasticity
plot(lm_boston$fitted.values, lm_boston$residuals)
abline(h = 0, col = "red")

Note

This plot is used to check for linearity and homoscedasticity. Residuals should be randomly scattered around zero without showing any specific pattern.

Normality of Residuals

Code
# Checking for normality
qqnorm(lm_boston$residuals)
qqline(lm_boston$residuals, col = "red")

Note

Q-Q Plot: If the points lie on the line, the residuals are normally distributed, which satisfies the normality assumption.

1.4 Detecting Multicollinearity with VIF

Multicollinearity occurs when the independent variables in a regression model are highly correlated, leading to unreliable estimates of regression coefficients. The Variance Inflation Factor (VIF) detects multicollinearity.

Calculating VIF

Code
# Load the necessary package
# install.packages("car")
library(car)
Warning: package 'car' was built under R version 4.3.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.3.3
Code
# Calculate VIF
vif(lm_boston)
   lstat       rm      age 
2.477304 1.675215 1.638542 
Note

VIF: A VIF value above 5 or 10 indicates a high level of multicollinearity, suggesting that the variable may need to be removed or combined with others to reduce redundancy.

1.5 Testing for Heteroscedasticity

Heteroscedasticity occurs when the variance of residuals is not constant across all levels of the independent variables. This violates the assumption of homoscedasticity, leading to inefficient estimates.

Breusch-Pagan Test

Code
# Load the necessary package
# install.packages("lmtest")
library(lmtest)
Warning: package 'lmtest' was built under R version 4.3.3
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Code
# Perform the Breusch-Pagan test
bptest(lm_boston)

    studentized Breusch-Pagan test

data:  lm_boston
BP = 19.771, df = 3, p-value = 0.0001894
Note

Breusch-Pagan Test: This test checks for heteroscedasticity in the model. A significant p-value (typically < 0.05) suggests the presence of heteroscedasticity.

Model Evaluation

Model evaluation assesses how well your regression model fits the data and how well it is likely to perform on new data.

Metrics for Model Evaluation:

  • R-squared: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
  • Adjusted R-squared: Adjusts R-squared for the number of predictors, providing a more accurate measure for models with multiple predictors.
  • AIC (Akaike Information Criterion): Balances model fit and complexity. Lower AIC values indicate a better model.

Comparing Models

Code
# Model with additional predictors
lm_boston_full <- lm(medv ~ lstat + rm + age + dis + tax, data = Boston)

# R-squared comparison
summary(lm_boston)$r.squared
[1] 0.6390341
Code
summary(lm_boston_full)$r.squared
[1] 0.6684895
Code
# Adjusted R-squared comparison
summary(lm_boston)$adj.r.squared
[1] 0.636877
Code
summary(lm_boston_full)$adj.r.squared
[1] 0.6651744
Code
# AIC comparison
AIC(lm_boston)
[1] 3174.88
Code
AIC(lm_boston_full)
[1] 3135.808
  • R-squared and Adjusted R-squared: These metrics help assess the explanatory power of your model. Adjusted R-squared is particularly useful when comparing models with different numbers of predictors.
  • AIC: This criterion helps balance the trade-off between model complexity and goodness of fit. A lower AIC value indicates a model that better fits the data while avoiding overfitting.
Note

Model Preference: A model with a higher Adjusted R-squared and lower AIC is typically preferred, but it’s also important to consider the predictors’ practical significance and the risk of overfitting.

Session 2: Hands-on Exercises

The diamonds dataset from the ggplot2 package contains data on 53,940 diamonds, with attributes that can be used to explore relationships between price, quality, and physical dimensions.

Key Variable Descriptions:

  • carat: Weight of the diamond (continuous variable).
  • cut: Quality of the cut, categorized as Fair, Good, Very Good, Premium, and Ideal (ordered factor).
  • color: Diamond color, with grades from D (best) to J (worst) (ordered factor).
  • clarity: Clarity of the diamond, ranging from I1 (worst) to IF (best) (ordered factor).
  • depth: Total depth percentage, which is the height of a diamond divided by its average diameter (continuous variable).
  • table: Width of the diamond’s table expressed as a percentage of its diameter (continuous variable).
  • price: Price of the diamond in US dollars (continuous variable).
Code
# Load the necessary package
library(ggplot2)
data("diamonds")
head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Choose a dependent variable (Y) and at least two independent variables (X1, X2, …). Fit a multiple linear regression model.

Selecting Variables and Fitting a Regression Model

Q1

  1. Use the lm() function to create a multiple linear regression model.
  2. Display the summary of the model to interpret the coefficients, R-squared, and p-values.
Code
# Example (students should choose their own variables):
# Dependent variable: price
# Independent variables: carat, cut, color

model <- lm(price ~ carat + cut + color, data = diamonds)

# Display the model summary
summary(model)

Call:
lm(formula = price ~ carat + cut + color, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-17313.9   -751.2    -83.9    543.6  12273.0 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -3149.82      15.76 -199.905  < 2e-16 ***
carat        8183.74      13.90  588.885  < 2e-16 ***
cut.L        1243.35      24.74   50.260  < 2e-16 ***
cut.Q        -531.75      21.93  -24.252  < 2e-16 ***
cut.C         372.06      19.16   19.417  < 2e-16 ***
cut^4          76.15      15.39    4.949 7.49e-07 ***
color.L     -1579.17      21.72  -72.699  < 2e-16 ***
color.Q      -732.85      19.86  -36.902  < 2e-16 ***
color.C      -107.41      18.64   -5.763 8.32e-09 ***
color^4        81.63      17.12    4.769 1.85e-06 ***
color^5      -138.64      16.18   -8.568  < 2e-16 ***
color^6      -161.09      14.68  -10.973  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1432 on 53928 degrees of freedom
Multiple R-squared:  0.8711,    Adjusted R-squared:  0.8711 
F-statistic: 3.315e+04 on 11 and 53928 DF,  p-value: < 2.2e-16

Residual Analysis

Q2

  1. Create residual plots to check for linearity, homoscedasticity, and normality.
  2. Use the par() function to display multiple plots at once.
Code
# Residual analysis
par(mfrow = c(2, 2))
plot(model)

Checking for Multicollinearity and Heteroscedasticity

Q3

  1. Calculate VIF for each independent variable in your model.
  2. Use the bptest() function from the lmtest package to test for heteroscedasticity.
  3. Interpret the results of the test.
Code
# Install the car package if not already installed
# install.packages("car")
library(car)

# Calculate VIF
vif_values <- vif(model)
vif_values
          GVIF Df GVIF^(1/(2*Df))
carat 1.141104  1        1.068225
cut   1.040331  4        1.004955
color 1.107001  6        1.008507
Code
# Install the lmtest package if not already installed
# install.packages("lmtest")
library(lmtest)

# Breusch-Pagan test for heteroscedasticity
bp_test <- bptest(model)
bp_test

    studentized Breusch-Pagan test

data:  model
BP = 8859.1, df = 11, p-value < 2.2e-16

Model Comparison and Evaluation

Q4

  1. Fit an alternative model with a different set of independent variables.
  2. Compare the models based on Adjusted R-squared and AIC to determine which model is better.
Code
# Alternative model with different independent variables
model2 <- lm(price ~ carat + clarity + depth, data = diamonds)

# Compare R-squared and Adjusted R-squared
summary(model)$adj.r.squared
[1] 0.8711236
Code
summary(model2)$adj.r.squared
[1] 0.895255
Code
# Compare AIC
AIC(model)
[1] 937048.2
Code
AIC(model2)
[1] 925863.1

Session 3: Wrapping Up and Q&A

In this session, we’ve explored essential concepts in regression diagnostics and model evaluation, including residual analysis, checking assumptions, identifying influential points, and model comparison using R-squared and AIC.

Practice these techniques with different datasets to better understand and improve your analytical skills.

Q&A: Please share any questions or challenges encountered during the exercises. Let’s discuss solutions and clarify concepts!


Reference

  • https://datageneration.io/dataprogrammingwithr/intro
  • Chicago Harris School Coding Camp
  • Data.gov, the U.S. government’s open data portal
  • https://cran.r-project.org/web/packages/MASS/MASS.pdf
  • https://ggplot2.tidyverse.org/reference/diamonds.html