EPPS Math Coding Camp

Introduction to Linear Regression

Author

Dohyo Jeong

Published

August 13, 2024

Session 1: Introduction to Linear Regression

1.1 Introduction and Data Overview

Dataset: Boston from the MASS Package

The Boston dataset, included in the MASS package, contains information about housing in the Boston area, collected by the U.S. Census Service. It includes 506 observations on 14 variables, encompassing various socio-economic and housing factors impacting home values and quality of life in different neighborhoods.

Key variables:
- crim: Crime rate per capita by town.
- zn: Proportion of residential land zoned for large lots.
- indus: Proportion of non-retail business acres per town.
- chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- nox: Nitrogen oxides concentration.
- rm: Average number of rooms per dwelling.
- age: Proportion of owner-occupied units built before 1940.
- dis: Distances to Boston employment centers.
- rad: Index of accessibility to radial highways.
- tax: Property tax rate per $10,000.
- ptratio: Pupil-teacher ratio by town.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town.
- lstat: Percentage of lower status population.
- medv: Median value of owner-occupied homes in $1000s.

Code
# Load necessary packages and data
# install.packages("MASS")
library(MASS)
data("Boston")

# Display the first few rows of the dataset
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7


Note

The MASS package contains the Boston dataset. The head() function shows the initial rows, providing a quick overview of the dataset’s structure and variables.


1.2 Descriptive Statistics

Overview and Importance


Descriptive statistics provide a summary of the dataset’s main features. This step is crucial in understanding the data’s central tendencies, dispersion, and overall distribution. Before diving into more complex analyses, it’s essential to identify any potential outliers or anomalies that could affect your results.

Calculating Descriptive Statistics

Code
# Summary statistics for the dataset
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  
Code
# Specific calculations for key variables
mean_medv <- mean(Boston$medv, na.rm = TRUE)
mean_medv
[1] 22.53281
Code
sd_medv <- sd(Boston$medv, na.rm = TRUE)
sd_medv
[1] 9.197104
Code
mean_crim <- mean(Boston$crim, na.rm = TRUE)
mean_crim
[1] 3.613524
Code
sd_crim <- sd(Boston$crim, na.rm = TRUE)
sd_crim
[1] 8.601545


Explanation:

summary(Boston) provides a detailed overview of the dataset, including min, max, median, mean, and quartiles for each variable.
Calculating means and standard deviations for medv and crim provides insights into typical home values and crime rates, along with their variability.

Interpreting Results:

Mean and Standard Deviation: The mean provides the average value, while the standard deviation indicates the spread of data around the mean. A high standard deviation relative to the mean suggests greater variability.

1.3 t-tests and ANOVA

1.3.1 t-tests

A t-test is used to compare the means of two groups to see if they are statistically significantly different from each other. For instance, you might want to test whether homes near the Charles River (chas = 1) have different median values than those further away (chas = 0).

Code
# t-test for medv based on proximity to Charles River
t_test_result <- t.test(medv ~ chas, data = Boston)

# Displaying the t-test result
t_test_result

    Welch Two Sample t-test

data:  medv by chas
t = -3.1133, df = 36.876, p-value = 0.003567
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -10.476831  -2.215483
sample estimates:
mean in group 0 mean in group 1 
       22.09384        28.44000 


Explanation:

The formula medv ~ chas specifies that the test compares median home values (medv) between homes near (chas = 1) and not near (chas = 0) the Charles River.
The output includes the t-statistic, degrees of freedom, and p-value, indicating whether the difference in means is statistically significant.

Interpreting Results:

p-value: A low p-value (typically < 0.05) suggests a significant difference in means, leading to the rejection of the null hypothesis that the means are equal.

1.3.2. ANOVA (Analysis of Variance)


ANOVA is used to determine if there are statistically significant differences between the means of three or more independent (unrelated) groups. For example, you can analyze whether median home values (medv) differ across different levels of highway accessibility (rad).

Code
# ANOVA for medv by rad
anova_result <- aov(medv ~ factor(rad), data = Boston)

# Summary of ANOVA
summary(anova_result)
             Df Sum Sq Mean Sq F value Pr(>F)    
factor(rad)   8   9767  1220.9   18.42 <2e-16 ***
Residuals   497  32949    66.3                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Explanation:

medv ~ factor(rad) treats rad as a categorical variable, analyzing the effect of accessibility to highways on home values.
aov() conducts the ANOVA, and summary() provides the F-statistic and p-value.

Interpreting Results:

F-statistic: Indicates the ratio of variance between groups to variance within groups. A higher F-value suggests greater differences between group means.
p-value: If the p-value is low, it indicates significant differences in means across the groups.

1.4 Simple Linear Regression

Introduction and Purpose


Simple linear regression is a method for quantifying the relationship between two continuous variables. It models the dependent variable (Y) as a linear function of the independent variable (X), allowing for predictions and insights into their relationship.

Example: Predicting Home Values

Code
# Linear regression: medv as a function of lstat
lm_model <- lm(medv ~ lstat, data = Boston)

# Summary of the linear model
summary(lm_model)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
Code
# Plotting the regression line
plot(Boston$lstat, Boston$medv, main = "Linear Regression of medv on lstat",
     xlab = "Percentage of Lower Status Population (lstat)", ylab = "Median Value of Homes (medv)")
abline(lm_model, col = "red")


Explanation:

medv ~ lstat specifies the model, predicting median home values (medv) based on the percentage of lower status population (lstat).
lm() fits the model, and summary() provides detailed statistics, including coefficients, R-squared value, and p-values.

Interpreting Results:

Coefficients: The slope coefficient indicates the change in medv for a one-unit increase in lstat. The intercept represents the predicted value of medv when lstat is zero.
R-squared: Measures the proportion of variance in medv explained by lstat. A higher R-squared indicates a better fit.
p-value: Tests the hypothesis that the slope is different from zero. A low p-value indicates a significant relationship.

Difference-in-Differences (DID) Analysis

Introduction and Purpose

The Difference-in-Differences (DID) method estimates causal relationships, particularly in observational studies where randomized control trials are not feasible. DID compares the changes in outcomes over time between a treatment group and a control group, helping to isolate the effect of a treatment or intervention.

Example: Impact of Policy Change on Home Values

Suppose a new policy was introduced in 1975 that potentially affected home values in certain areas (e.g., stricter zoning laws). The DID approach could help estimate the policy’s impact by comparing the change in home values in affected areas (treatment group) to those in unaffected areas (control group).

Code
# Creating a hypothetical policy introduction year and treatment variable
Boston$year <- sample(1970:1980, nrow(Boston), replace = TRUE)
Boston$treat <- ifelse(Boston$zn > 15, 1, 0)  # Treatment if zn > 15

head(Boston$year)
[1] 1978 1979 1972 1972 1971 1979
Code
head(Boston$treat)
[1] 1 0 0 0 0 0
Code
# DID Model: Interaction of time and treatment
did_model <- lm(medv ~ treat * year, data = Boston)

# Summary of the DID model
summary(did_model)

Call:
lm(formula = medv ~ treat * year, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.8790  -5.6767  -0.7523   3.1181  30.2373 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -255.1102   261.2266  -0.977    0.329
treat        737.7864   556.2475   1.326    0.185
year           0.1395     0.1323   1.055    0.292
treat:year    -0.3694     0.2817  -1.311    0.190

Residual standard error: 8.457 on 502 degrees of freedom
Multiple R-squared:  0.1594,    Adjusted R-squared:  0.1544 
F-statistic: 31.74 on 3 and 502 DF,  p-value: < 2.2e-16

Session 2: Hands-on Exercises

The diamonds dataset from the ggplot2 package contains data on 53,940 diamonds, with attributes that can be used to explore relationships between price, quality, and physical dimensions.

Key Variable Descriptions:

  • carat: Weight of the diamond (continuous variable).
  • cut: Quality of the cut, categorized as Fair, Good, Very Good, Premium, and Ideal (ordered factor).
  • color: Diamond color, with grades from D (best) to J (worst) (ordered factor).
  • clarity: Clarity of the diamond, ranging from I1 (worst) to IF (best) (ordered factor).
  • depth: Total depth percentage, which is the height of a diamond divided by its average diameter (continuous variable).
  • table: Width of the diamond’s table expressed as a percentage of its diameter (continuous variable).
  • price: Price of the diamond in US dollars (continuous variable).
Code
# Load the necessary package
library(ggplot2)
data("diamonds")

head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Exercise 1: Descriptive Statistics


Q1

  1. Calculate the mean and standard deviation of the carat, price, and depth variables.
  2. Use the summary() function to get a detailed summary of the price variable.
  3. Determine the number of diamonds within each cut category using the table() function.

Solution

Code
# Step 1: Calculate mean and standard deviation
mean_carat <- mean(diamonds$carat, na.rm = TRUE)
sd_carat <- sd(diamonds$carat, na.rm = TRUE)
mean_price <- mean(diamonds$price, na.rm = TRUE)
sd_price <- sd(diamonds$price, na.rm = TRUE)
mean_depth <- mean(diamonds$depth, na.rm = TRUE)
sd_depth <- sd(diamonds$depth, na.rm = TRUE)

mean_carat
[1] 0.7979397
Code
sd_carat
[1] 0.4740112
Code
mean_price
[1] 3932.8
Code
sd_price
[1] 3989.44
Code
mean_depth
[1] 61.7494
Code
sd_depth
[1] 1.432621
Code
# Step 2: Summary of the price variable
summary_price <- summary(diamonds$price)
summary_price
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    326     950    2401    3933    5324   18823 
Code
# Step 3: Number of diamonds by cut category
cut_distribution <- table(diamonds$cut)
cut_distribution

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 


Exercise 2: t-test


Q2

  1. Subset the diamonds dataset to include only diamonds with an “Ideal” or “Fair” cut.
  2. Perform a t-test to compare the mean prices between these two cut categories.
  3. nterpret the p-value to determine if the difference in mean prices is statistically significant.

Solution

Code
# Subset the data for Ideal and Fair cuts
diamonds_ideal_fair <- subset(diamonds, cut %in% c("Ideal", "Fair"))

# Perform the t-test
t_test_result <- t.test(price ~ cut, data = diamonds_ideal_fair)
t_test_result

    Welch Two Sample t-test

data:  price by cut
t = 9.7484, df = 1894.8, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Fair and group Ideal is not equal to 0
95 percent confidence interval:
  719.9065 1082.5251
sample estimates:
 mean in group Fair mean in group Ideal 
           4358.758            3457.542 

Exercise 3: ANOVA


Q3

  1. Perform an ANOVA test to assess the effect of clarity on diamond price.
  2. If significant differences are found, conduct a Tukey’s HSD post-hoc test to identify which clarity levels differ from each other.

Solution

Code
# Perform ANOVA on price by clarity
anova_result <- aov(price ~ clarity, data = diamonds)
summary(anova_result)
               Df    Sum Sq   Mean Sq F value Pr(>F)    
clarity         7 2.331e+10 3.330e+09     215 <2e-16 ***
Residuals   53932 8.352e+11 1.549e+07                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Post-hoc test if ANOVA is significant
tukey_result <- TukeyHSD(anova_result)
tukey_result
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = price ~ clarity, data = diamonds)

$clarity
                   diff          lwr         upr     p adj
SI2-I1     1138.8599147   683.395891  1594.32394 0.0000000
SI1-I1       71.8324571  -378.570901   522.23582 0.9997320
VS2-I1        0.8207037  -450.377702   452.01911 1.0000000
VS1-I1      -84.7132999  -542.298929   372.87233 0.9992819
VVS2-I1    -640.4316203 -1109.531923  -171.33132 0.0009165
VVS1-I1   -1401.0540535 -1881.569711  -920.53840 0.0000000
IF-I1     -1059.3295848 -1580.334655  -538.32451 0.0000000
SI1-SI2   -1067.0274575 -1229.386830  -904.66808 0.0000000
VS2-SI2   -1138.0392109 -1302.591274  -973.48715 0.0000000
VS1-SI2   -1223.5732146 -1404.907129 -1042.23930 0.0000000
VVS2-SI2  -1779.2915349 -1987.983831 -1570.59924 0.0000000
VVS1-SI2  -2539.9139681 -2773.136347 -2306.69159 0.0000000
IF-SI2    -2198.1894995 -2506.318797 -1890.06020 0.0000000
VS2-SI1     -71.0117534  -220.988718    78.96521 0.8410824
VS1-SI1    -156.5457571  -324.764949    11.67343 0.0899007
VVS2-SI1   -712.2640774  -909.667681  -514.86047 0.0000000
VVS1-SI1  -1472.8865106 -1696.064436 -1249.70859 0.0000000
IF-SI1    -1131.1620420 -1431.760399  -830.56369 0.0000000
VS1-VS2     -85.5340037  -255.870471    84.80246 0.7958312
VVS2-VS2   -641.2523240  -840.463263  -442.04138 0.0000000
VVS1-VS2  -1401.8747572 -1626.652874 -1177.09664 0.0000000
IF-VS2    -1060.1502885 -1361.938605  -758.36197 0.0000000
VVS2-VS1   -555.7183203  -769.001243  -342.43540 0.0000000
VVS1-VS1  -1316.3407535 -1553.679770 -1079.00174 0.0000000
IF-VS1     -974.6162849 -1285.873083  -663.35949 0.0000000
VVS1-VVS2  -760.6224332 -1019.466585  -501.77828 0.0000000
IF-VVS2    -418.8979645  -746.848084   -90.94785 0.0027364
IF-VVS1     341.7244687    -2.356168   685.80510 0.0531204

Exercise 4: Simple Linear Regression


Q4

  1. Fit a simple linear regression model with carat as the predictor and price as the response variable.
  2. Display the summary of the regression model, including coefficients, R-squared, and p-value.
  3. Create a scatter plot of price versus carat, and add the regression line to visualize the relationship.

Solution

Code
# Fit the linear regression model
lm_model <- lm(price ~ carat, data = diamonds)
summary(lm_model)

Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18585.3   -804.8    -18.9    537.4  12731.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8493 
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
Code
# Create a scatter plot and add the regression line
plot(diamonds$carat, diamonds$price, main = "Linear Regression of Price on Carat",
     xlab = "Carat", ylab = "Price")
abline(lm_model, col = "blue")

Session 3: Wrapping Up and Q&A

In this session, we’ve explored key statistical methods, including descriptive statistics, t-tests, ANOVA, and simple linear regression, using the Boston dataset. These methods are essential for analyzing and interpreting data in various fields.

Continue practicing these techniques with different datasets and variables to strengthen your understanding and analytical skills.

Q&A: Please share any questions or challenges encountered during the exercises. Let’s discuss solutions and clarify concepts!

Reference

  • https://datageneration.io/dataprogrammingwithr/intro
  • Chicago Harris School Coding Camp
  • Data.gov, the U.S. government’s open data portal
  • https://cran.r-project.org/web/packages/MASS/MASS.pdf
  • https://ggplot2.tidyverse.org/reference/diamonds.html