We learned access elements of matrix and vectors in the previous sessions. Indexing in Dataframes is similar. Remember, R uses 1-based indexing (the first element is at position 1, not 0).
Examples:
A quick review of the DataFrame
Code
# Load data ev_data <- readxl::read_excel("Electric_Vehicle_Population_Data.xlsx")head(ev_data)
# A tibble: 6 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVRO… VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Observe what is the return of the following codes:
Code
ev_data[1, 1]
# A tibble: 1 × 1
`VIN (1-10)`
<chr>
1 1C4RJXN66R
Code
ev_data[, 2]
# A tibble: 194,232 × 1
County
<chr>
1 Snohomish
2 King
3 King
4 Kitsap
5 Snohomish
6 Yakima
7 Thurston
8 Snohomish
9 Snohomish
10 Snohomish
# ℹ 194,222 more rows
Code
ev_data[, c(2,3)]
# A tibble: 194,232 × 2
County City
<chr> <chr>
1 Snohomish Everett
2 King Renton
3 King Seattle
4 Kitsap Port Orchard
5 Snohomish Monroe
6 Yakima Moxee
7 Thurston Olympia
8 Snohomish Monroe
9 Snohomish Everett
10 Snohomish Everett
# ℹ 194,222 more rows
Code
ev_data[1, ]
# A tibble: 1 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRANGLER
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Code
ev_data[1:3, ]
# A tibble: 3 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRANGLER
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODEL 3
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Mind the type of the output. Observe what is the return of the following codes and see the difference:
Code
ev_data[[1, 1]]
[1] "1C4RJXN66R"
Code
class(ev_data[[1, 1]])
[1] "character"
Code
class(ev_data[1, 1])
[1] "tbl_df" "tbl" "data.frame"
What will happen if you run this line and why?
Code
# ev_data[[, 2]]
Exercise:
Load the Electric Vehicle Population Data (Excel) on your machine
# Subsetting specific columnsev_data[, c("Make", "Model", "Electric_Range")]
# A tibble: 194,232 × 3
Make Model Electric_Range
<chr> <chr> <dbl>
1 JEEP WRANGLER 21
2 KIA SOUL 93
3 TESLA MODEL 3 266
4 CHEVROLET VOLT 53
5 BMW X5 30
6 FORD ESCAPE 38
7 TESLA MODEL S 208
8 NISSAN LEAF 84
9 PORSCHE CAYENNE 14
10 NISSAN LEAF 0
# ℹ 194,222 more rows
Exercise:
Get the 10th row’s ‘City’ Column value.
Note
Code
ev_data[10, 'City']
# A tibble: 1 × 1
City
<chr>
1 Everett
1.1.3 Subsetting Data With condition
Subsetting is the process of extracting a portion of a dataset based on specific conditions. This can be useful for focusing on particular groups or conditions within your data.
Detailed Explanation:
Row and Column Subsetting: Use indexing to subset specific rows and columns.
Logical Conditions: Subset based on logical conditions using operators like ==, !=, <, >, &, |.
Examples:
Code
# Subsetting rows based on a conditionev_subset <- ev_data[ev_data$Model_Year ==2020, ]# Subsetting rows and columnsev_subset_cols <- ev_data[ev_data$Model_Year ==2020, c("Make", "Electric_Range")]
Using logic operator to do complicated selection
Code
# Find new cars with high electric rangeev_data[ev_data$Model_Year >2018& ev_data$Electric_Range >=100 , c("Make", "Model")]
# A tibble: 19,761 × 2
Make Model
<chr> <chr>
1 TESLA MODEL 3
2 TESLA MODEL 3
3 TESLA MODEL Y
4 CHEVROLET BOLT EV
5 TESLA MODEL 3
6 TESLA MODEL 3
7 TESLA MODEL 3
8 NISSAN LEAF
9 TESLA MODEL 3
10 TESLA MODEL 3
# ℹ 19,751 more rows
Code
# Find cars that is near me (Richardson is a city in Dallas and Collin counties)ev_data[(ev_data$County =='Collin'| ev_data$County =='Dallas'), c("Make", "Model")]
# A tibble: 10 × 2
Make Model
<chr> <chr>
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 TESLA MODEL 3
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>
Code
# Find cars outside TX that are made by Tesla or is expensiveev_data[ev_data$State !='TX'& (ev_data$Make =='TESLA'| ev_data$`Base MSRP`>=10000), c("Make", "Model")]
# A tibble: 87,104 × 2
Make Model
<chr> <chr>
1 KIA SOUL
2 TESLA MODEL 3
3 TESLA MODEL S
4 TESLA MODEL 3
5 TESLA MODEL S
6 TESLA MODEL 3
7 TESLA MODEL Y
8 TESLA MODEL X
9 TESLA MODEL S
10 TESLA MODEL 3
# ℹ 87,094 more rows
ev_data$Model_Year == 2020: Subsets rows where the year is 2020.
Try subsetting the data using different conditions and explore the resulting subsets.
Exercise:
I want a car that is cheap or is cheap to run. Find the vehicles that are in Texas but not in Richardson, if their Electric_Range is larger than 20 or the Base MSRP is lower than 10000. Return the Make, Model and Electric Vehicle Type.
# A tibble: 22 × 3
Make Model `Electric Vehicle Type`
<chr> <chr> <chr>
1 TESLA MODEL X Battery Electric Vehicle (BEV)
2 TESLA MODEL Y Battery Electric Vehicle (BEV)
3 TESLA MODEL 3 Battery Electric Vehicle (BEV)
4 FORD MUSTANG MACH-E Battery Electric Vehicle (BEV)
5 TOYOTA PRIUS PRIME Plug-in Hybrid Electric Vehicle (PHEV)
6 TESLA MODEL Y Battery Electric Vehicle (BEV)
7 TESLA MODEL Y Battery Electric Vehicle (BEV)
8 TESLA MODEL 3 Battery Electric Vehicle (BEV)
9 TESLA MODEL S Battery Electric Vehicle (BEV)
10 FORD FUSION Plug-in Hybrid Electric Vehicle (PHEV)
# ℹ 12 more rows
1.2 Basic Data Transformation
Data transformation involves modifying data to fit a certain format or to prepare it for analysis. This includes operations like renaming columns, creating new variables, and changing data types.
Detailed Explanation:
Renaming Columns: Use the names() function or dplyr::rename().
Creating New Variables: Add new columns based on calculations or conditions.
Changing Data Types: Convert columns to appropriate data types using functions like as.numeric(), as.factor(), etc.
Code
# Renaming columnsnames(ev_data)[names(ev_data) =="Postal_Code"] <-"Zip_Code"# Creating new variablesev_data$Vehicle_Age <-2025- ev_data$Model_Year# Changing data typesev_data$Make <-as.factor(ev_data$Make)summary(ev_data)
VIN (1-10) County City State
Length:194232 Length:194232 Length:194232 Length:194232
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Zip_Code Model_Year Make Model
Min. : 1731 Min. :1997 TESLA :85320 Length:194232
1st Qu.:98052 1st Qu.:2019 CHEVROLET:14578 Class :character
Median :98125 Median :2022 NISSAN :14360 Mode :character
Mean :98176 Mean :2021 FORD :10266
3rd Qu.:98372 3rd Qu.:2023 KIA : 8497
Max. :99577 Max. :2025 BMW : 8150
NA's :9 (Other) :53061
Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility
Length:194232 Length:194232
Class :character Class :character
Mode :character Mode :character
Electric_Range Base MSRP Legislative District DOL Vehicle ID
Min. : 0.00 Min. : 0.0 Min. : 1.00 Min. : 4385
1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:17.00 1st Qu.:187225086
Median : 0.00 Median : 0.0 Median :33.00 Median :233940241
Mean : 54.84 Mean : 978.7 Mean :29.01 Mean :224892309
3rd Qu.: 68.00 3rd Qu.: 0.0 3rd Qu.:42.00 3rd Qu.:260115926
Max. :337.00 Max. :845000.0 Max. :49.00 Max. :479254772
NA's :2 NA's :2 NA's :432
Vehicle Location Electric Utility 2020 Census Tract Vehicle_Age
Length:194232 Length:194232 Min. :1.001e+09 Min. : 0.000
Class :character Class :character 1st Qu.:5.303e+10 1st Qu.: 2.000
Mode :character Mode :character Median :5.303e+10 Median : 3.000
Mean :5.298e+10 Mean : 4.218
3rd Qu.:5.305e+10 3rd Qu.: 6.000
Max. :5.602e+10 Max. :28.000
NA's :9
- names(): Retrieves or sets the names of an object’s elements. - as.factor(): Converts a variable to factor
Exercise:
Normalize the scores to a Electric_Range between 0 and 1 and store these in a new column named normalized_range.
VIN (1-10)
0
County
9
City
9
State
0
Zip_Code
9
Model_Year
0
Make
0
Model
0
Electric Vehicle Type
0
Clean Alternative Fuel Vehicle (CAFV) Eligibility
0
Electric_Range
0
Base MSRP
0
Legislative District
432
DOL Vehicle ID
0
Vehicle Location
13
Electric Utility
9
2020 Census Tract
9
Vehicle_Age
0
normalized_range
0
Code
# Indicates which elements are missing within a specific columnev_data[is.na(ev_data$County ),]
# A tibble: 9 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1G1RB6S53J <NA> <NA> BC NA 2018 CHEVROLET VOLT
2 7SAYGDEE4N <NA> <NA> CA NA 2022 TESLA MODEL Y
3 7SAYGAEE3P <NA> <NA> CA NA 2023 TESLA MODEL Y
4 3FA6P0PU8H <NA> <NA> KS NA 2017 FORD FUSION
5 5YJXCAE24H <NA> <NA> BC NA 2017 TESLA MODEL X
6 5YJ3E1EA5K <NA> <NA> BC NA 2019 TESLA MODEL 3
7 WBAJA9C50K <NA> <NA> AE NA 2019 BMW 530E
8 YV4H60CL5N <NA> <NA> GA NA 2022 VOLVO XC90
9 5YJ3E1EA1J <NA> <NA> AL NA 2018 TESLA MODEL 3
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Indicates which elements are not missing within a specific columnev_data[!is.na(ev_data$City ),]
# A tibble: 194,221 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVROLET VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
7 5YJSA1DNXD Thurston Olympia WA 98506 2013 TESLA MODE…
8 1N4AZ0CP8F Snohomish Monroe WA 98272 2015 NISSAN LEAF
9 WP1AE2A21J Snohomish Everett WA 98208 2018 PORSCHE CAYE…
10 1N4BZ1BV4N Snohomish Everett WA 98208 2022 NISSAN LEAF
# ℹ 194,211 more rows
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Removes missing values of a specific column by subsettingev_data <- ev_data[!is.na(ev_data$City),]# Check na Removalsum(is.na(ev_data))
[1] 427
Code
# Removes ALL missing values in data framena.omit(ev_data)
# A tibble: 193,794 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVROLET VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
7 5YJSA1DNXD Thurston Olympia WA 98506 2013 TESLA MODE…
8 1N4AZ0CP8F Snohomish Monroe WA 98272 2015 NISSAN LEAF
9 WP1AE2A21J Snohomish Everett WA 98208 2018 PORSCHE CAYE…
10 1N4BZ1BV4N Snohomish Everett WA 98208 2022 NISSAN LEAF
# ℹ 193,784 more rows
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Check na Removalsum(is.na(ev_data))
[1] 427
Exercise:
Omit the missing value in MSRP column and run the exercises again, tell me if there is any difference.
1.4 Introduction to the dplyr Package
dplyr is a powerful package for data manipulation in R. It provides a set of functions (verbs) that help in transforming and summarizing data easily.
Key Verbs:
filter(): Select rows based on conditions.
select(): Choose specific columns.
mutate(): Create new columns.
arrange(): Order rows by columns.
summarize() and group_by(): Aggregate data by groups.
Make Model Model_Year
TESLA :79461 Length:164897 Min. :2018
CHEVROLET: 8887 Class :character 1st Qu.:2021
KIA : 7910 Mode :character Median :2022
FORD : 7126 Mean :2022
NISSAN : 6848 3rd Qu.:2023
BMW : 6265 Max. :2025
(Other) :48400
Code
# Creating a new variableev_selected <- ev_selected %>%mutate(Age =2025- Model_Year)# Arranging dataev_sorted <- ev_selected %>%arrange(desc(Model_Year))summary(ev_sorted)
Make Model Model_Year Age
TESLA :79461 Length:164897 Min. :2018 Min. :0.000
CHEVROLET: 8887 Class :character 1st Qu.:2021 1st Qu.:2.000
KIA : 7910 Mode :character Median :2022 Median :3.000
FORD : 7126 Mean :2022 Mean :3.204
NISSAN : 6848 3rd Qu.:2023 3rd Qu.:4.000
BMW : 6265 Max. :2025 Max. :7.000
(Other) :48400
Create a new column in the Electric Vehicle data representing the vehicle’s age.
Note
Code
# Your code herenames(life_data)[names(life_data) =="Death_Rate"] <-"Age_adjusted_Death_Rate"ev_data$Vehicle_Age <-2025- ev_data$Model_Year
Session 3: Wrapping Up and Q&A
In this session, we’ve covered essential data manipulation techniques, including subsetting, transforming data, and using the dplyr package. These skills are critical for preparing and analyzing data effectively.
Continue exploring different data manipulation techniques and apply them to various datasets to enhance your proficiency.
Q&A: Feel free to ask questions or share your experiences with data manipulation in R. Let’s discuss any challenges or interesting findings from your exercises.
---title: "EPPS Math Coding Camp"subtitle: "Basic Data Manipulation"Instructor: "Xingyuan Zhao"date: "August 12, 2024"format: html: toc: true toc-depth: 3 code-fold: show code-tools: true highlight-style: github---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)```# Session 1: Basic Data Manipulation## 1.1.1 Slicing/IndexWe learned access elements of matrix and vectors in the previous sessions. Indexing in Dataframes is similar. Remember, R uses 1-based indexing (the first element is at position 1, not 0).### Examples:A quick review of the DataFrame```{r}# Load data ev_data <- readxl::read_excel("Electric_Vehicle_Population_Data.xlsx")head(ev_data)```Observe what is the return of the following codes:```{r}ev_data[1, 1]ev_data[, 2]ev_data[, c(2,3)]ev_data[1, ]ev_data[1:3, ]```Mind the type of the output. Observe what is the return of the following codes and see the difference:```{r}ev_data[[1, 1]]class(ev_data[[1, 1]])class(ev_data[1, 1])```What will happen if you run this line and why?```{r}# ev_data[[, 2]]```### Exercise:- 1. Load the Electric Vehicle Population Data (Excel) on your machine- 2. Return the Model_Year Columns using only indexing::: {.callout-note collapse=true}```{r}ev_data[,6]```:::## 1.1.2 Using Variable Names```{r}# Subsetting specific columnhead(ev_data$Make)# Subsetting specific columnsev_data[, c("Make", "Model", "Electric_Range")]```### Exercise:- Get the 10th row's 'City' Column value.::: {.callout-note collapse=true}```{r}ev_data[10, 'City']```:::## 1.1.3 Subsetting Data With conditionSubsetting is the process of extracting a portion of a dataset based on specific conditions. This can be useful for focusing on particular groups or conditions within your data.### Detailed Explanation:- **Row and Column Subsetting**: Use indexing to subset specific rows and columns.- **Logical Conditions**: Subset based on logical conditions using operators like `==`, `!=`, `<`, `>`, `&`, `|`.### Examples:```{r}# Subsetting rows based on a conditionev_subset <- ev_data[ev_data$Model_Year ==2020, ]# Subsetting rows and columnsev_subset_cols <- ev_data[ev_data$Model_Year ==2020, c("Make", "Electric_Range")]```# Using logic operator to do complicated selection```{r}# Find new cars with high electric rangeev_data[ev_data$Model_Year >2018& ev_data$Electric_Range >=100 , c("Make", "Model")]# Find cars that is near me (Richardson is a city in Dallas and Collin counties)ev_data[(ev_data$County =='Collin'| ev_data$County =='Dallas'), c("Make", "Model")]# Find cars outside TX that are made by Tesla or is expensiveev_data[ev_data$State !='TX'& (ev_data$Make =='TESLA'| ev_data$`Base MSRP`>=10000), c("Make", "Model")]```<br>- ev_data$Model_Year == 2020: Subsets rows where the year is 2020. <br>- [, c("Make", "Model", "Electric_Range")]: Subsets specific columns by their names. <br><br>Try subsetting the data using different conditions and explore the resulting subsets.<br>### Exercise:- 1. I want a car that is cheap or is cheap to run. Find the vehicles that are in Texas but not in Richardson, if their Electric_Range is larger than 20 or the Base MSRP is lower than 10000. Return the Make, Model and Electric Vehicle Type. ::: {.callout-note collapse=true}```{r}ev_data[(ev_data$State =='TX'& ev_data$City !='Richardson') & (ev_data$Electric_Range >=200| ev_data$`Base MSRP`<=10000),c("Make", "Model", "Electric Vehicle Type")]```:::## 1.2 Basic Data TransformationData transformation involves modifying data to fit a certain format or to prepare it for analysis. This includes operations like renaming columns, creating new variables, and changing data types.<br>#### Detailed Explanation:- **Renaming Columns**: Use the names() function or dplyr::rename(). <br>- **Creating New Variables**: Add new columns based on calculations or conditions. <br>- **Changing Data Types**: Convert columns to appropriate data types using functions like as.numeric(), as.factor(), etc. <br>```{r}# Renaming columnsnames(ev_data)[names(ev_data) =="Postal_Code"] <-"Zip_Code"# Creating new variablesev_data$Vehicle_Age <-2025- ev_data$Model_Year# Changing data typesev_data$Make <-as.factor(ev_data$Make)summary(ev_data)```<br>- **names()**: Retrieves or sets the names of an object's elements. <br>- **as.factor()**: Converts a variable to factor<br><br>### Exercise:- Normalize the scores to a Electric_Range between 0 and 1 and store these in a new column named normalized_range.hint: $$\text{normalized\_value} = \frac{\text{value} - \min(\text{value})}{\max(\text{value}) - \min(\text{value})}$$::: {.callout-note collapse=true}```{r}class(ev_data$Electric_Range)ev_data <- ev_data[!is.na(ev_data$Electric_Range),]ev_data$normalized_range <- (ev_data$Electric_Range -min(ev_data$Electric_Range)) / (max(ev_data$Electric_Range) -min(ev_data$Electric_Range))ev_data$normalized_range```:::## 1.3 Missing Values```{r}# Count na of dfsum(is.na(ev_data))# Count na of each colcolSums(is.na(ev_data))# Indicates which elements are missing within a specific columnev_data[is.na(ev_data$County ),]# Indicates which elements are not missing within a specific columnev_data[!is.na(ev_data$City ),]# Removes missing values of a specific column by subsettingev_data <- ev_data[!is.na(ev_data$City),]# Check na Removalsum(is.na(ev_data))# Removes ALL missing values in data framena.omit(ev_data)# Check na Removalsum(is.na(ev_data))```### Exercise:- Omit the missing value in MSRP column and run the exercises again, tell me if there is any difference. ## 1.4 Introduction to the dplyr Packagedplyr is a powerful package for data manipulation in R. It provides a set of functions (verbs) that help in transforming and summarizing data easily.<br>#### Key Verbs:- **filter()**: Select rows based on conditions. <br>- **select()**: Choose specific columns. <br>- **mutate()**: Create new columns. <br>- **arrange()**: Order rows by columns. <br>- **summarize() and group_by()**: Aggregate data by groups. <br>```{r}# Loading dplyr# install.packages("dplyr")library(dplyr)# Filtering dataev_recent <- ev_data %>%filter(Model_Year >=2018)# Selecting columnsev_selected <- ev_recent %>%select(Make, Model, Model_Year)summary(ev_selected)# Creating a new variableev_selected <- ev_selected %>%mutate(Age =2025- Model_Year)# Arranging dataev_sorted <- ev_selected %>%arrange(desc(Model_Year))summary(ev_sorted)# Summarizing dataev_summary <- ev_data %>%group_by(Make) %>%summarize(Total_Electric_Range =sum(Electric_Range, na.rm =TRUE))head(ev_summary)```<br>- **%>% (pipe operator)**: Passes the result from the left-hand side to the function on the right-hand side. <br>- **filter()**: Filters rows based on a logical condition. <br>- **select()**: Selects specific columns. <br>- **mutate()**: Adds new columns or transforms existing ones. <br>- **arrange()**: Orders rows by specified columns. <br>- **summarize() and group_by()**: Aggregates data by grouping variables. <br><br>Utilize dplyr to perform complex data manipulations succinctly and efficiently.<br># Session 2: Hands-on Exercises## 2.1 Subsetting and Filtering Data1) Subset the US_Life_expectancy data for a specific state and year.2) Select only the columns "Year" and "Age_adjusted_Death_Rate" from US_Life_expectancy data.<br>::: {.callout-note collapse=true}```{r}# Your code herelife_data <-read.csv("US_Life_expectancy.csv")life_gender_year <- life_data[life_data$Gender =="Gender"& life_data$Year ==2000, ]life_make_model <- life_data[, c("Year", "Age_adjusted_Death_Rate")]```<br>:::## 2.2 Data Transformation1) Rename a column in the US_Life_expectancy data. <br>2) Create a new column in the Electric Vehicle data representing the vehicle's age. <br>::: {.callout-note collapse=true}```{r}# Your code herenames(life_data)[names(life_data) =="Death_Rate"] <-"Age_adjusted_Death_Rate"ev_data$Vehicle_Age <-2025- ev_data$Model_Year```<br>:::# Session 3: Wrapping Up and Q&AIn this session, we've covered essential data manipulation techniques, including subsetting, transforming data, and using the dplyr package. These skills are critical for preparing and analyzing data effectively.<br>Continue exploring different data manipulation techniques and apply them to various datasets to enhance your proficiency.<br>**Q&A**: Feel free to ask questions or share your experiences with data manipulation in R. Let's discuss any challenges or interesting findings from your exercises.<br>### Reference- https://datageneration.io/dataprogrammingwithr/intro- Chicago Harris School Coding Camp- Data.gov, the U.S. government's open data portal