EPPS Math Coding Camp

Multiple Linear Regression

Author

Dohyo Jeong

Published

August 13, 2024

Session 1: Multiple Linear Regression

1.1 Introduction and Data Overview

What is Multiple Linear Regression?

Multiple linear regression is an extension of simple linear regression in which the relationship between a single dependent variable and two or more independent variables is modeled. This approach helps capture the complexity of real-world data by considering multiple factors that could influence the outcome.

Dataset: Boston from the MASS Package

The Boston dataset from the MASS package provides information on housing in the Boston area, including socio-economic and environmental factors. It contains 506 observations on 14 variables.

Key variables:
- crim: Crime rate per capita by town.
- zn: Proportion of residential land zoned for large lots.
- indus: Proportion of non-retail business acres per town.
- chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- nox: Nitrogen oxides concentration.
- rm: Average number of rooms per dwelling.
- age: Proportion of owner-occupied units built before 1940.
- dis: Distances to Boston employment centers.
- rad: Index of accessibility to radial highways.
- tax: Property tax rate per $10,000.
- ptratio: Pupil-teacher ratio by town.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town.
- lstat: Percentage of lower status population.
- medv: Median value of owner-occupied homes in $1000s.

Code
# Load necessary packages and data
# install.packages("MASS")
library(MASS)
data("Boston")

# Display the first few rows of the dataset
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7


1.2 Extending to Multiple Predictors

Concept and Purpose

In simple linear regression, we model the relationship between a dependent variable and one independent variable. However, in real-world scenarios, outcomes are often influenced by multiple factors. Multiple linear regression allows us to model the relationship between the dependent variable (in this case, medv) and multiple independent variables, providing a more comprehensive understanding of the factors affecting the outcome.

Example: Fitting a Multiple Linear Regression Model

Code
# Multiple linear regression model
multi_lm <- lm(medv ~ crim + rm + age + dis + tax, data = Boston)

# Summary of the model
summary(multi_lm)

Call:
lm(formula = medv ~ crim + rm + age + dis + tax, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.614  -2.911  -0.833   1.987  40.959 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.796901   3.236248  -3.645 0.000295 ***
crim         -0.140933   0.037531  -3.755 0.000194 ***
rm            7.731058   0.390879  19.779  < 2e-16 ***
age          -0.079425   0.014272  -5.565 4.28e-08 ***
dis          -0.944654   0.194106  -4.867 1.52e-06 ***
tax          -0.011553   0.002144  -5.389 1.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.856 on 500 degrees of freedom
Multiple R-squared:  0.5986,    Adjusted R-squared:  0.5946 
F-statistic: 149.2 on 5 and 500 DF,  p-value: < 2.2e-16


Explanation:

The formula medv ~ crim + rm + age + dis + tax specifies the response variable (medv) and the predictor variables (crim, rm, age, dis, and tax).
lm() fits the multiple linear regression model, and summary() provides detailed output, including coefficients, R-squared, and p-values.

Interpreting Results:

  • Coefficients: Each coefficient represents the change in the dependent variable (medv) for a one-unit change in the corresponding independent variable, holding all other variables constant.
    For example, if the coefficient for rm is positive, it indicates that an increase in the average number of rooms per dwelling is associated with an increase in the median home value.
  • R-squared: This statistic measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. A higher R-squared indicates a better fit.
  • p-values: These values test the null hypothesis that a given coefficient is equal to zero (i.e., the variable has no effect). A p-value less than 0.05 typically indicates that the variable is statistically significant.

1.3 Interpreting Coefficients and p-values

Detailed Interpretation

Interpreting the coefficients and p-values in a multiple regression model is crucial for understanding the relationships between the variables.

Example Interpretation:

  • crim coefficient: Suppose the coefficient for crim (crime rate) is -0.2. This negative value suggests that, holding other factors constant, an increase in the crime rate by one unit is associated with a decrease in the median home value by $200 (since the dependent variable medv is in $1000s).
  • rm coefficient: If the coefficient for rm is 5.0, it indicates that, on average, each additional room in a dwelling is associated with an increase in the median home value by $5000, holding other variables constant.
  • age coefficient: A positive coefficient for age might suggest that older neighborhoods (with a higher proportion of homes built before 1940) are associated with higher median home values, assuming other factors are held constant.

Significance:

p-value < 0.05: A predictor with a p-value less than 0.05 is considered statistically significant, meaning there is strong evidence that the predictor is associated with the response variable.
Confidence Intervals: These provide a range of values within which the true coefficient is likely to fall, offering insight into the precision of the estimates.

1.4 Stepwise Regression and Model Selection

Overview

Stepwise regression selects a subset of variables that contribute the most to predicting the dependent variable. It combines statistical techniques with automated procedures to add or remove predictors based on their significance and impact on model performance.


Example 1: Forward Selection

Forward selection begins with an empty model and adds predictors one at a time, selecting the predictor that improves the model the most at each step.

Code
# install.packages("stats")
# Forward selection starting from an empty model
empty_model <- lm(medv ~ 1, data = Boston)
full_model <- lm(medv ~ ., data = Boston)

forward_model <- stats::step(empty_model, 
                             scope = list(lower = empty_model, upper = full_model),
                             direction = "forward")
Start:  AIC=2246.51
medv ~ 1

          Df Sum of Sq   RSS    AIC
+ lstat    1   23243.9 19472 1851.0
+ rm       1   20654.4 22062 1914.2
+ ptratio  1   11014.3 31702 2097.6
+ indus    1    9995.2 32721 2113.6
+ tax      1    9377.3 33339 2123.1
+ nox      1    7800.1 34916 2146.5
+ crim     1    6440.8 36276 2165.8
+ rad      1    6221.1 36495 2168.9
+ age      1    6069.8 36647 2171.0
+ zn       1    5549.7 37167 2178.1
+ black    1    4749.9 37966 2188.9
+ dis      1    2668.2 40048 2215.9
+ chas     1    1312.1 41404 2232.7
<none>                 42716 2246.5

Step:  AIC=1851.01
medv ~ lstat

          Df Sum of Sq   RSS    AIC
+ rm       1    4033.1 15439 1735.6
+ ptratio  1    2670.1 16802 1778.4
+ chas     1     786.3 18686 1832.2
+ dis      1     772.4 18700 1832.5
+ age      1     304.3 19168 1845.0
+ tax      1     274.4 19198 1845.8
+ black    1     198.3 19274 1847.8
+ zn       1     160.3 19312 1848.8
+ crim     1     146.9 19325 1849.2
+ indus    1      98.7 19374 1850.4
<none>                 19472 1851.0
+ rad      1      25.1 19447 1852.4
+ nox      1       4.8 19468 1852.9

Step:  AIC=1735.58
medv ~ lstat + rm

          Df Sum of Sq   RSS    AIC
+ ptratio  1   1711.32 13728 1678.1
+ chas     1    548.53 14891 1719.3
+ black    1    512.31 14927 1720.5
+ tax      1    425.16 15014 1723.5
+ dis      1    351.15 15088 1725.9
+ crim     1    311.42 15128 1727.3
+ rad      1    180.45 15259 1731.6
+ indus    1     61.09 15378 1735.6
<none>                 15439 1735.6
+ zn       1     56.56 15383 1735.7
+ age      1     20.18 15419 1736.9
+ nox      1     14.90 15424 1737.1

Step:  AIC=1678.13
medv ~ lstat + rm + ptratio

        Df Sum of Sq   RSS    AIC
+ dis    1    499.08 13229 1661.4
+ black  1    389.68 13338 1665.6
+ chas   1    377.96 13350 1666.0
+ crim   1    122.52 13606 1675.6
+ age    1     66.24 13662 1677.7
<none>               13728 1678.1
+ tax    1     44.36 13684 1678.5
+ nox    1     24.81 13703 1679.2
+ zn     1     14.96 13713 1679.6
+ rad    1      6.07 13722 1679.9
+ indus  1      0.83 13727 1680.1

Step:  AIC=1661.39
medv ~ lstat + rm + ptratio + dis

        Df Sum of Sq   RSS    AIC
+ nox    1    759.56 12469 1633.5
+ black  1    502.64 12726 1643.8
+ chas   1    267.43 12962 1653.1
+ indus  1    242.65 12986 1654.0
+ tax    1    240.34 12989 1654.1
+ crim   1    233.54 12995 1654.4
+ zn     1    144.81 13084 1657.8
+ age    1     61.36 13168 1661.0
<none>               13229 1661.4
+ rad    1     22.40 13206 1662.5

Step:  AIC=1633.47
medv ~ lstat + rm + ptratio + dis + nox

        Df Sum of Sq   RSS    AIC
+ chas   1    328.27 12141 1622.0
+ black  1    311.83 12158 1622.7
+ zn     1    151.71 12318 1629.3
+ crim   1    141.43 12328 1629.7
+ rad    1     53.48 12416 1633.3
<none>               12469 1633.5
+ indus  1     17.10 12452 1634.8
+ tax    1     10.50 12459 1635.0
+ age    1      0.25 12469 1635.5

Step:  AIC=1621.97
medv ~ lstat + rm + ptratio + dis + nox + chas

        Df Sum of Sq   RSS    AIC
+ black  1   272.837 11868 1612.5
+ zn     1   164.406 11977 1617.1
+ crim   1   116.330 12025 1619.1
+ rad    1    58.556 12082 1621.5
<none>               12141 1622.0
+ indus  1    26.274 12115 1622.9
+ tax    1     4.187 12137 1623.8
+ age    1     2.331 12139 1623.9

Step:  AIC=1612.47
medv ~ lstat + rm + ptratio + dis + nox + chas + black

        Df Sum of Sq   RSS    AIC
+ zn     1   189.936 11678 1606.3
+ rad    1   144.320 11724 1608.3
+ crim   1    55.633 11813 1612.1
<none>               11868 1612.5
+ indus  1    15.584 11853 1613.8
+ age    1     9.446 11859 1614.1
+ tax    1     2.703 11866 1614.4

Step:  AIC=1606.31
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn

        Df Sum of Sq   RSS    AIC
+ crim   1    94.712 11584 1604.2
+ rad    1    93.614 11585 1604.2
<none>               11678 1606.3
+ indus  1    16.048 11662 1607.6
+ tax    1     3.952 11674 1608.1
+ age    1     1.491 11677 1608.2

Step:  AIC=1604.19
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim

        Df Sum of Sq   RSS    AIC
+ rad    1   228.604 11355 1596.1
<none>               11584 1604.2
+ indus  1    15.773 11568 1605.5
+ age    1     2.470 11581 1606.1
+ tax    1     1.305 11582 1606.1

Step:  AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad

        Df Sum of Sq   RSS    AIC
+ tax    1   273.619 11081 1585.8
<none>               11355 1596.1
+ indus  1    33.894 11321 1596.6
+ age    1     0.096 11355 1598.1

Step:  AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad + tax

        Df Sum of Sq   RSS    AIC
<none>               11081 1585.8
+ indus  1   2.51754 11079 1587.7
+ age    1   0.06271 11081 1587.8
Code
summary(forward_model)

Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
    black + zn + crim + rad + tax, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
black         0.009291   0.002674   3.475 0.000557 ***
zn            0.045845   0.013523   3.390 0.000754 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16


- empty_model: This is the simplest model, containing only the intercept (no predictors). - The basic model starts with all predictors (lm(medv ~ .)).
- The scope argument defines the range of models to consider, from the simplest (intercept only) to the full model (all predictors).
- The step() function iteratively adds predictors that reduce the AIC, improving the model fit.

Example 2: Backward Elimination

Backward elimination starts with the full model and removes the least significant predictors one by one, based on criteria like AIC.

Code
# Backward elimination starting from the full model
full_model <- lm(medv ~ ., data = Boston)
backward_model <- stats::step(full_model, direction = "backward")
Start:  AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- age      1      0.06 11079 1587.7
- indus    1      2.52 11081 1587.8
<none>                 11079 1589.6
- chas     1    218.97 11298 1597.5
- tax      1    242.26 11321 1598.6
- crim     1    243.22 11322 1598.6
- zn       1    257.49 11336 1599.3
- black    1    270.63 11349 1599.8
- rad      1    479.15 11558 1609.1
- nox      1    487.16 11566 1609.4
- ptratio  1   1194.23 12273 1639.4
- dis      1   1232.41 12311 1641.0
- rm       1   1871.32 12950 1666.6
- lstat    1   2410.84 13490 1687.3

Step:  AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
    ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- indus    1      2.52 11081 1585.8
<none>                 11079 1587.7
- chas     1    219.91 11299 1595.6
- tax      1    242.24 11321 1596.6
- crim     1    243.20 11322 1596.6
- zn       1    260.32 11339 1597.4
- black    1    272.26 11351 1597.9
- rad      1    481.09 11560 1607.2
- nox      1    520.87 11600 1608.9
- ptratio  1   1200.23 12279 1637.7
- dis      1   1352.26 12431 1643.9
- rm       1   1959.55 13038 1668.0
- lstat    1   2718.88 13798 1696.7

Step:  AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq   RSS    AIC
<none>                 11081 1585.8
- chas     1    227.21 11309 1594.0
- crim     1    245.37 11327 1594.8
- zn       1    257.82 11339 1595.4
- black    1    270.82 11352 1596.0
- tax      1    273.62 11355 1596.1
- rad      1    500.92 11582 1606.1
- nox      1    541.91 11623 1607.9
- ptratio  1   1206.45 12288 1636.0
- dis      1   1448.94 12530 1645.9
- rm       1   1963.66 13045 1666.3
- lstat    1   2723.48 13805 1695.0
Code
summary(backward_model)

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
    tax + ptratio + black + lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16


Explanation:

The model starts with all predictors (lm(medv ~ .)).
The step() function removes predictors that contribute least to the model fit, as indicated by the highest AIC.

Example 3: Bidirectional Elimination

Bidirectional elimination combines both forward selection and backward elimination, allowing variables to be added or removed based on their significance.

Code
# Bidirectional elimination
empty_model <- lm(medv ~ 1, data = Boston)
full_model <- lm(medv ~ ., data = Boston)

both_model <- stats::step(empty_model, 
                          scope = list(lower = empty_model, upper = full_model), 
                          direction = "both")
Start:  AIC=2246.51
medv ~ 1

          Df Sum of Sq   RSS    AIC
+ lstat    1   23243.9 19472 1851.0
+ rm       1   20654.4 22062 1914.2
+ ptratio  1   11014.3 31702 2097.6
+ indus    1    9995.2 32721 2113.6
+ tax      1    9377.3 33339 2123.1
+ nox      1    7800.1 34916 2146.5
+ crim     1    6440.8 36276 2165.8
+ rad      1    6221.1 36495 2168.9
+ age      1    6069.8 36647 2171.0
+ zn       1    5549.7 37167 2178.1
+ black    1    4749.9 37966 2188.9
+ dis      1    2668.2 40048 2215.9
+ chas     1    1312.1 41404 2232.7
<none>                 42716 2246.5

Step:  AIC=1851.01
medv ~ lstat

          Df Sum of Sq   RSS    AIC
+ rm       1    4033.1 15439 1735.6
+ ptratio  1    2670.1 16802 1778.4
+ chas     1     786.3 18686 1832.2
+ dis      1     772.4 18700 1832.5
+ age      1     304.3 19168 1845.0
+ tax      1     274.4 19198 1845.8
+ black    1     198.3 19274 1847.8
+ zn       1     160.3 19312 1848.8
+ crim     1     146.9 19325 1849.2
+ indus    1      98.7 19374 1850.4
<none>                 19472 1851.0
+ rad      1      25.1 19447 1852.4
+ nox      1       4.8 19468 1852.9
- lstat    1   23243.9 42716 2246.5

Step:  AIC=1735.58
medv ~ lstat + rm

          Df Sum of Sq   RSS    AIC
+ ptratio  1    1711.3 13728 1678.1
+ chas     1     548.5 14891 1719.3
+ black    1     512.3 14927 1720.5
+ tax      1     425.2 15014 1723.5
+ dis      1     351.2 15088 1725.9
+ crim     1     311.4 15128 1727.3
+ rad      1     180.5 15259 1731.6
+ indus    1      61.1 15378 1735.6
<none>                 15439 1735.6
+ zn       1      56.6 15383 1735.7
+ age      1      20.2 15419 1736.9
+ nox      1      14.9 15424 1737.1
- rm       1    4033.1 19472 1851.0
- lstat    1    6622.6 22062 1914.2

Step:  AIC=1678.13
medv ~ lstat + rm + ptratio

          Df Sum of Sq   RSS    AIC
+ dis      1     499.1 13229 1661.4
+ black    1     389.7 13338 1665.6
+ chas     1     378.0 13350 1666.0
+ crim     1     122.5 13606 1675.6
+ age      1      66.2 13662 1677.7
<none>                 13728 1678.1
+ tax      1      44.4 13684 1678.5
+ nox      1      24.8 13703 1679.2
+ zn       1      15.0 13713 1679.6
+ rad      1       6.1 13722 1679.9
+ indus    1       0.8 13727 1680.1
- ptratio  1    1711.3 15439 1735.6
- rm       1    3074.3 16802 1778.4
- lstat    1    5013.6 18742 1833.7

Step:  AIC=1661.39
medv ~ lstat + rm + ptratio + dis

          Df Sum of Sq   RSS    AIC
+ nox      1     759.6 12469 1633.5
+ black    1     502.6 12726 1643.8
+ chas     1     267.4 12962 1653.1
+ indus    1     242.6 12986 1654.0
+ tax      1     240.3 12989 1654.1
+ crim     1     233.5 12995 1654.4
+ zn       1     144.8 13084 1657.8
+ age      1      61.4 13168 1661.0
<none>                 13229 1661.4
+ rad      1      22.4 13206 1662.5
- dis      1     499.1 13728 1678.1
- ptratio  1    1859.3 15088 1725.9
- rm       1    2622.6 15852 1750.9
- lstat    1    5349.2 18578 1831.2

Step:  AIC=1633.47
medv ~ lstat + rm + ptratio + dis + nox

          Df Sum of Sq   RSS    AIC
+ chas     1     328.3 12141 1622.0
+ black    1     311.8 12158 1622.7
+ zn       1     151.7 12318 1629.3
+ crim     1     141.4 12328 1629.7
+ rad      1      53.5 12416 1633.3
<none>                 12469 1633.5
+ indus    1      17.1 12452 1634.8
+ tax      1      10.5 12459 1635.0
+ age      1       0.2 12469 1635.5
- nox      1     759.6 13229 1661.4
- dis      1    1233.8 13703 1679.2
- ptratio  1    2116.5 14586 1710.8
- rm       1    2546.2 15016 1725.5
- lstat    1    3664.3 16134 1761.8

Step:  AIC=1621.97
medv ~ lstat + rm + ptratio + dis + nox + chas

          Df Sum of Sq   RSS    AIC
+ black    1     272.8 11868 1612.5
+ zn       1     164.4 11977 1617.1
+ crim     1     116.3 12025 1619.1
+ rad      1      58.6 12082 1621.5
<none>                 12141 1622.0
+ indus    1      26.3 12115 1622.9
+ tax      1       4.2 12137 1623.8
+ age      1       2.3 12139 1623.9
- chas     1     328.3 12469 1633.5
- nox      1     820.4 12962 1653.1
- dis      1    1146.8 13288 1665.6
- ptratio  1    1924.9 14066 1694.4
- rm       1    2480.7 14622 1714.0
- lstat    1    3509.3 15650 1748.5

Step:  AIC=1612.47
medv ~ lstat + rm + ptratio + dis + nox + chas + black

          Df Sum of Sq   RSS    AIC
+ zn       1    189.94 11678 1606.3
+ rad      1    144.32 11724 1608.3
+ crim     1     55.63 11813 1612.1
<none>                 11868 1612.5
+ indus    1     15.58 11853 1613.8
+ age      1      9.45 11859 1614.1
+ tax      1      2.70 11866 1614.4
- black    1    272.84 12141 1622.0
- chas     1    289.27 12158 1622.7
- nox      1    626.85 12495 1636.5
- dis      1   1103.33 12972 1655.5
- ptratio  1   1804.30 13672 1682.1
- rm       1   2658.21 14526 1712.7
- lstat    1   2991.55 14860 1724.2

Step:  AIC=1606.31
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn

          Df Sum of Sq   RSS    AIC
+ crim     1     94.71 11584 1604.2
+ rad      1     93.61 11585 1604.2
<none>                 11678 1606.3
+ indus    1     16.05 11662 1607.6
+ tax      1      3.95 11674 1608.1
+ age      1      1.49 11677 1608.2
- zn       1    189.94 11868 1612.5
- black    1    298.37 11977 1617.1
- chas     1    300.42 11979 1617.2
- nox      1    627.62 12306 1630.8
- dis      1   1276.45 12955 1656.8
- ptratio  1   1364.63 13043 1660.2
- rm       1   2384.55 14063 1698.3
- lstat    1   3052.50 14731 1721.8

Step:  AIC=1604.19
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim

          Df Sum of Sq   RSS    AIC
+ rad      1    228.60 11355 1596.1
<none>                 11584 1604.2
+ indus    1     15.77 11568 1605.5
+ age      1      2.47 11581 1606.1
+ tax      1      1.31 11582 1606.1
- crim     1     94.71 11678 1606.3
- black    1    222.18 11806 1611.8
- zn       1    229.02 11813 1612.1
- chas     1    284.34 11868 1614.5
- nox      1    578.44 12162 1626.8
- ptratio  1   1192.90 12776 1651.8
- dis      1   1345.70 12929 1657.8
- rm       1   2419.57 14003 1698.2
- lstat    1   2753.42 14337 1710.1

Step:  AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad

          Df Sum of Sq   RSS    AIC
+ tax      1    273.62 11081 1585.8
<none>                 11355 1596.1
+ indus    1     33.89 11321 1596.6
+ age      1      0.10 11355 1598.1
- zn       1    171.14 11526 1601.7
- rad      1    228.60 11584 1604.2
- crim     1    229.70 11585 1604.2
- chas     1    272.67 11628 1606.1
- black    1    295.78 11651 1607.1
- nox      1    785.16 12140 1627.9
- dis      1   1341.37 12696 1650.6
- ptratio  1   1419.77 12775 1653.7
- rm       1   2182.57 13538 1683.1
- lstat    1   2785.28 14140 1705.1

Step:  AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad + tax

          Df Sum of Sq   RSS    AIC
<none>                 11081 1585.8
+ indus    1      2.52 11079 1587.7
+ age      1      0.06 11081 1587.8
- chas     1    227.21 11309 1594.0
- crim     1    245.37 11327 1594.8
- zn       1    257.82 11339 1595.4
- black    1    270.82 11352 1596.0
- tax      1    273.62 11355 1596.1
- rad      1    500.92 11582 1606.1
- nox      1    541.91 11623 1607.9
- ptratio  1   1206.45 12288 1636.0
- dis      1   1448.94 12530 1645.9
- rm       1   1963.66 13045 1666.3
- lstat    1   2723.48 13805 1695.0
Code
summary(both_model)

Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
    black + zn + crim + rad + tax, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
black         0.009291   0.002674   3.475 0.000557 ***
zn            0.045845   0.013523   3.390 0.000754 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16


Explanation:

This approach starts with either an empty model or a full model and considers adding or removing predictors based on the improvement in the criterion.

Session 2: Hands-on Exercises

  1. Explore the Boston dataset and select one dependent variable (Y) and at least three independent variables (X1, X2, X3, …).
  2. Use summary() and str() to explore the dataset and understand available variables. Choose a dependent variable (Y) that you think could be influenced by other factors in the dataset.
  3. Select at least three independent variables (X1, X2, X3, etc.) that may impact the dependent variable.
Code
# Load necessary package and data
library(MASS)
data("Boston")
Note

Think about the relationships between variables. For example, does the crime rate (crim) affect the median home value (medv)? Or could a house’s number of rooms (rm) predict home value?

Tip

Consider using cor() to check the correlations between variables, which might help select predictors.

Solution

Code
# Exploring the dataset
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  
Code
str(Boston)
'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Code
# Example selection (students should choose their own):
# Dependent variable (Y): medv (Median value of owner-occupied homes)
# Independent variables (X): crim (crime rate), rm (average number of rooms), tax (property tax rate)

# Check correlations (optional)
cor(Boston[c("medv", "crim", "rm", "tax")])
           medv       crim         rm        tax
medv  1.0000000 -0.3883046  0.6953599 -0.4685359
crim -0.3883046  1.0000000 -0.2192467  0.5827643
rm    0.6953599 -0.2192467  1.0000000 -0.2920478
tax  -0.4685359  0.5827643 -0.2920478  1.0000000


Exercise 1: Fitting a Multiple Linear Regression Model

Q1

  1. Use the lm() function to create a multiple linear regression model with your selected variables.
  2. Display the summary of the model to interpret the coefficients, R-squared, and p-values.
  3. For each independent variable, interpret the coefficient in terms of its impact on the dependent variable.
Code
# Example based on a hypothetical selection (students should use their own selected variables):
# Fitting the multiple linear regression model
model <- lm(medv ~ crim + rm + tax, data = Boston)

# Display the model summary
summary(model)

Call:
lm(formula = medv ~ crim + rm + tax, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.322  -3.105  -0.680   2.327  41.081 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -21.798367   2.805644  -7.769 4.48e-14 ***
crim         -0.138675   0.038512  -3.601 0.000349 ***
rm            7.901643   0.400605  19.724  < 2e-16 ***
tax          -0.011823   0.002005  -5.896 6.83e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.037 on 502 degrees of freedom
Multiple R-squared:  0.5716,    Adjusted R-squared:  0.5691 
F-statistic: 223.3 on 3 and 502 DF,  p-value: < 2.2e-16

Exercise 2: Refining the Model with Stepwise Regression

Q2

  1. Apply forward selection, backward elimination, or bidirectional elimination to your model using the step() function.
  2. Compare the new model with your initial model and discuss any differences in the selected variables and model performance.

Solution

Code
# Forward Selection
forward_model <- step(lm(medv ~ 1, data = Boston), 
                      scope = list(lower = lm(medv ~ 1, data = Boston), 
                                   upper = lm(medv ~ crim + rm + tax + age + dis, data = Boston)), 
                      direction = "forward")
Start:  AIC=2246.51
medv ~ 1

       Df Sum of Sq   RSS    AIC
+ rm    1   20654.4 22062 1914.2
+ tax   1    9377.3 33339 2123.1
+ crim  1    6440.8 36276 2165.8
+ age   1    6069.8 36647 2171.0
+ dis   1    2668.2 40048 2215.9
<none>              42716 2246.5

Step:  AIC=1914.19
medv ~ rm

       Df Sum of Sq   RSS    AIC
+ tax   1    3290.8 18771 1834.5
+ crim  1    2496.1 19566 1855.4
+ age   1    1997.0 20065 1868.2
+ dis   1     512.6 21549 1904.3
<none>              22062 1914.2

Step:  AIC=1834.45
medv ~ rm + tax

       Df Sum of Sq   RSS    AIC
+ crim  1    472.61 18299 1823.5
+ age   1    403.42 18368 1825.5
<none>              18771 1834.5
+ dis   1     55.81 18715 1834.9

Step:  AIC=1823.55
medv ~ rm + tax + crim

       Df Sum of Sq   RSS    AIC
+ age   1    341.95 17957 1816.0
+ dis   1     92.13 18206 1823.0
<none>              18299 1823.5

Step:  AIC=1816
medv ~ rm + tax + crim + age

       Df Sum of Sq   RSS    AIC
+ dis   1    812.12 17144 1794.6
<none>              17957 1816.0

Step:  AIC=1794.58
medv ~ rm + tax + crim + age + dis
Code
summary(forward_model)

Call:
lm(formula = medv ~ rm + tax + crim + age + dis, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.614  -2.911  -0.833   1.987  40.959 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.796901   3.236248  -3.645 0.000295 ***
rm            7.731058   0.390879  19.779  < 2e-16 ***
tax          -0.011553   0.002144  -5.389 1.09e-07 ***
crim         -0.140933   0.037531  -3.755 0.000194 ***
age          -0.079425   0.014272  -5.565 4.28e-08 ***
dis          -0.944654   0.194106  -4.867 1.52e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.856 on 500 degrees of freedom
Multiple R-squared:  0.5986,    Adjusted R-squared:  0.5946 
F-statistic: 149.2 on 5 and 500 DF,  p-value: < 2.2e-16
Code
# Backward Elimination
backward_model <- step(lm(medv ~ crim + rm + tax + age + dis, data = Boston), 
                       direction = "backward")
Start:  AIC=1794.58
medv ~ crim + rm + tax + age + dis

       Df Sum of Sq   RSS    AIC
<none>              17144 1794.6
- crim  1     483.5 17628 1806.7
- dis   1     812.1 17957 1816.0
- tax   1     995.9 18140 1821.2
- age   1    1061.9 18206 1823.0
- rm    1   13413.6 30558 2085.0
Code
summary(backward_model)

Call:
lm(formula = medv ~ crim + rm + tax + age + dis, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.614  -2.911  -0.833   1.987  40.959 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.796901   3.236248  -3.645 0.000295 ***
crim         -0.140933   0.037531  -3.755 0.000194 ***
rm            7.731058   0.390879  19.779  < 2e-16 ***
tax          -0.011553   0.002144  -5.389 1.09e-07 ***
age          -0.079425   0.014272  -5.565 4.28e-08 ***
dis          -0.944654   0.194106  -4.867 1.52e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.856 on 500 degrees of freedom
Multiple R-squared:  0.5986,    Adjusted R-squared:  0.5946 
F-statistic: 149.2 on 5 and 500 DF,  p-value: < 2.2e-16
Code
# Bidirectional Elimination
both_model <- step(lm(medv ~ 1, data = Boston), 
                   scope = list(lower = lm(medv ~ 1, data = Boston), 
                                upper = lm(medv ~ crim + rm + tax + age + dis, data = Boston)), 
                   direction = "both")
Start:  AIC=2246.51
medv ~ 1

       Df Sum of Sq   RSS    AIC
+ rm    1   20654.4 22062 1914.2
+ tax   1    9377.3 33339 2123.1
+ crim  1    6440.8 36276 2165.8
+ age   1    6069.8 36647 2171.0
+ dis   1    2668.2 40048 2215.9
<none>              42716 2246.5

Step:  AIC=1914.19
medv ~ rm

       Df Sum of Sq   RSS    AIC
+ tax   1    3290.8 18771 1834.5
+ crim  1    2496.1 19566 1855.4
+ age   1    1997.0 20065 1868.2
+ dis   1     512.6 21549 1904.3
<none>              22062 1914.2
- rm    1   20654.4 42716 2246.5

Step:  AIC=1834.45
medv ~ rm + tax

       Df Sum of Sq   RSS    AIC
+ crim  1     472.6 18298 1823.5
+ age   1     403.4 18368 1825.5
<none>              18771 1834.5
+ dis   1      55.8 18715 1834.9
- tax   1    3290.8 22062 1914.2
- rm    1   14567.9 33339 2123.1

Step:  AIC=1823.55
medv ~ rm + tax + crim

       Df Sum of Sq   RSS    AIC
+ age   1     341.9 17957 1816.0
+ dis   1      92.1 18206 1823.0
<none>              18298 1823.5
- crim  1     472.6 18771 1834.5
- tax   1    1267.3 19566 1855.4
- rm    1   14181.2 32480 2111.9

Step:  AIC=1816
medv ~ rm + tax + crim + age

       Df Sum of Sq   RSS    AIC
+ dis   1     812.1 17144 1794.6
<none>              17957 1816.0
- age   1     341.9 18299 1823.5
- crim  1     411.1 18368 1825.5
- tax   1     683.5 18640 1832.9
- rm    1   13551.4 31508 2098.5

Step:  AIC=1794.58
medv ~ rm + tax + crim + age + dis

       Df Sum of Sq   RSS    AIC
<none>              17144 1794.6
- crim  1     483.5 17628 1806.7
- dis   1     812.1 17957 1816.0
- tax   1     995.9 18140 1821.2
- age   1    1061.9 18206 1823.0
- rm    1   13413.6 30558 2085.0
Code
summary(both_model)

Call:
lm(formula = medv ~ rm + tax + crim + age + dis, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.614  -2.911  -0.833   1.987  40.959 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.796901   3.236248  -3.645 0.000295 ***
rm            7.731058   0.390879  19.779  < 2e-16 ***
tax          -0.011553   0.002144  -5.389 1.09e-07 ***
crim         -0.140933   0.037531  -3.755 0.000194 ***
age          -0.079425   0.014272  -5.565 4.28e-08 ***
dis          -0.944654   0.194106  -4.867 1.52e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.856 on 500 degrees of freedom
Multiple R-squared:  0.5986,    Adjusted R-squared:  0.5946 
F-statistic: 149.2 on 5 and 500 DF,  p-value: < 2.2e-16

Session 3: Wrapping Up and Q&A

In this session, we’ve covered multiple linear regression, extending to multiple predictors, interpreting coefficients and p-values, and using stepwise regression for model selection. These techniques are fundamental for building predictive models and understanding complex relationships in data.

Practice these methods with different datasets and variables to better understand and improve your analytical skills.

Q&A: Please share any questions or challenges encountered during the exercises. Let’s discuss solutions and clarify concepts!

Reference

  • https://datageneration.io/dataprogrammingwithr/intro
  • Chicago Harris School Coding Camp
  • Data.gov, the U.S. government’s open data portal
  • https://cran.r-project.org/web/packages/MASS/MASS.pdf
  • https://rdrr.io/cran/ISLR/man/Wage.html