EPPS Math Coding Camp

Advanced Data Manipulation with dplyr

Published

August 13, 2024

Instructor: Xingyuan Zhao

Session 1: Advanced Data Manipulation with dplyr

1.1 Introduction


dplyr is a powerful R package for data manipulation. It provides intuitive and efficient functions for transforming data frames, making it a staple in data analysis workflows. This session will cover key dplyr functions, including group_by(), summarize(), joining data frames, and reshaping data with tidyr.

1.2 Loading the Data

Code
# Load necessary packages
# install.packages("dplyr")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
# Reading the CSV file
life_expectancy_data <- read.csv("US_Life_expectancy.csv")

# View the first few rows of the data
head(life_expectancy_data)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate
1 1900 Female                    48.3                  2410.4
2 1901 Female                    50.6                  2350.5
3 1902 Female                    53.4                  2162.8
4 1903 Female                    52.0                  2250.6
5 1904 Female                    49.1                  2358.8
6 1905 Female                    50.2                  2287.7


The read.csv() function loads the dataset, and head() displays the first few rows, providing an overview of the data structure and variables.

1.3 Grouping and Summarizing Data

Grouping Data with group_by()

Code
# Grouping data by Gender
grouped_data <- life_expectancy_data %>% group_by(Gender)


The group_by() function groups the dataset by the specified variable (Gender).
This operation is essential for performing group-wise computations, such as calculating averages or counts.

Aggregate Grouped Data with summarize()

Code
# Summarizing Average Life Expectancy by Gender
summary_data <- grouped_data %>%
  summarize(Average_Life_Expectancy = mean(Average_Life_Expectancy, na.rm = TRUE),
            Average_Death_Rate = mean(Age_adjusted_Death_Rate, na.rm = TRUE))

# Viewing the summarized data
summary_data
# A tibble: 2 × 3
  Gender Average_Life_Expectancy Average_Death_Rate
  <chr>                    <dbl>              <dbl>
1 Female                    69.6              1288.
2 Male                      64.4              1662.


The summarize() function calculates summary statistics for each group created by group_by().
Here, it computes the average life expectancy and average death rate for each gender, ignoring NA values with na.rm = TRUE.
This function is highly customizable, allowing for a wide range of summary calculations.

Exercise:

Group the data by Year, Aggregate by sum the Age_adjusted_Death_Rate

Code
year_summary <- life_expectancy_data %>%
  group_by(Year) %>%
  summarize(Total_Death_Rate = sum(Age_adjusted_Death_Rate, na.rm = TRUE))
head(year_summary)
# A tibble: 6 × 2
   Year Total_Death_Rate
  <int>            <dbl>
1  1900            5041.
2  1901            4951 
3  1902            4611.
4  1903            4764.
5  1904            5012.
6  1905            4853.

1.4 Joining Data Frames

Introduction to Joins


Joining data frames is crucial when working with related datasets.
dplyr provides several join functions, such as inner_join(), left_join(), right_join(), and full_join(). These functions merge datasets based on common keys.

Code
# Creating a sample dataset for joining based on the year 2000

data_to_join <- data.frame(
  Year = 1800:2010,
  Year_Category = ifelse(1800:2010 <2010, "Before 2000", "After 2000")
)

# Joining data frames
joined_data_l <- left_join(life_expectancy_data, data_to_join, by = c('Year'))

joined_data_r <- right_join(life_expectancy_data, data_to_join, by = c('Year'))

joined_data_i <- inner_join(life_expectancy_data, data_to_join, by = "Year")

joined_data_o <- full_join(life_expectancy_data, data_to_join, by = "Year")

# Viewing the joined data
head(joined_data_l)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_r)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_i)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_o)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000


additional_data_2000 specifies categories for “Before 2000” and “After 2000” for each gender.
left_join() merges life_expectancy_data and additional_data_2000 based on the Gender and Year_Category columns.
In a left join, all rows from the left data frame are kept, with matching rows from the right data frame added.

merge() Or you can use merge().

Parameters: merge(x, y, by = “common_column”, all = FALSE, all.x = FALSE, all.y = FALSE,…)


Exercise:

Create a new dataset with a column specifying if the year is before or after 1945, and join it ON the original dataset, keeping the original dataset intact and without na values. Hint: which side of join you should use?

Code
additional_data_war <- data.frame(
  Year = 1900:2020,
  Year_Category = ifelse(1900:2020 <1945, "Before WW2", "After WW2")
)

# Joining data frames
joined_data_ww2 <- left_join(life_expectancy_data, additional_data_war, by = 'Year')

# Viewing the joined data
head(joined_data_ww2)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4    Before WW2
2 1901 Female                    50.6                  2350.5    Before WW2
3 1902 Female                    53.4                  2162.8    Before WW2
4 1903 Female                    52.0                  2250.6    Before WW2
5 1904 Female                    49.1                  2358.8    Before WW2
6 1905 Female                    50.2                  2287.7    Before WW2

1.5 Adding data using function with mutate()

Add a column of calculated values using Mutate

Code
life_expectancy_data_dev <- mutate(
  life_expectancy_data, 
  deviation=Average_Life_Expectancy - mean(Average_Life_Expectancy))

Exercise:

  • Normalize the Average_Life_Expectancy between 0 and 1 and store these in a new column named accordingly.

hint:

\[ \text{normalized\_value} = \frac{\text{value} - \min(\text{value})}{\max(\text{value}) - \min(\text{value})} \]

Code
life_expectancy_data_norm <- mutate(
  life_expectancy_data, 
  life_expectancy_normalized = (Average_Life_Expectancy - min(Average_Life_Expectancy)) / 
                               (max(Average_Life_Expectancy) - min(Average_Life_Expectancy))
)

Session 2: Wrapping Up and Q&A

In this session, we’ve explored advanced data manipulation techniques using dplyr, including grouping and summarizing data, joining data frames, and reshaping data with tidyr. These skills are crucial for preparing and analyzing data efficiently.

Continue practicing these techniques with different datasets to solidify your understanding and explore new data analysis possibilities.

Q&A: Please share any questions or challenges you’ve encountered. Let’s discuss different approaches and solutions!

Reference

  • https://datageneration.io/dataprogrammingwithr/intro
  • Chicago Harris School Coding Camp
  • Data.gov, the U.S. government’s open data portal