The working directory is the folder in which you are currently working.
Code
# 1.1.1 Direct import If the data is in the same folder as the current working directoryair_data <-read.csv("Air_Quality.csv")# 1.1.2 Setting working directory before direct import#setwd('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code')air_data <-read.csv("Air_Quality.csv")# 1.1.3 Reading by giving the absolute address#air_data <- read.csv('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code/Air_Quality.csv')
1.2 Loading data of various formats
file type
package
function
.csv
readr
read_csv()
.dta (stata)
haven
read_dta()
.xlsx
readxl
read_xlsx()
1.2.1 Reading CSV Files
CSV (Comma-Separated Values) files are commonly used for storing tabular data. In R, you can use the read.csv() function to import CSV files.
Example: Importing a CSV File
Code
# Reading a CSV fileair_data <-read.csv("Air_Quality.csv")#air_data <- read.csv("D:/2024Fall/DataCamp/Air_Quality.csv")# Viewing the first few rows of the datahead(air_data)
Unique_ID Indicator_ID Name Measure
1 179772 640 Boiler Emissions- Total SO2 Emissions Number per km2
2 179785 640 Boiler Emissions- Total SO2 Emissions Number per km2
3 178540 365 Fine particles (PM 2.5) Mean
4 178561 365 Fine particles (PM 2.5) Mean
5 823217 365 Fine particles (PM 2.5) Mean
6 177910 365 Fine particles (PM 2.5) Mean
Measure_Info Type_Name Year Place_Name Data.Value
1 number UHF42 2015 Southeast Queens 0.3
2 number UHF42 2015 Bensonhurst - Bay Ridge 1.2
3 mcg/m3 UHF42 2012 Bensonhurst - Bay Ridge 8.6
4 mcg/m3 UHF42 2012 Southeast Queens 8.0
5 mcg/m3 UHF42 2022 Southeast Queens 6.1
6 mcg/m3 UHF42 2012 Bensonhurst - Bay Ridge 10.0
Explanation:
read.csv(): Reads a CSV file into a data frame. Specify the file path and name.
File Path: The file path can be absolute or relative. Always ensure the correct path format, especially on different operating systems.
head(): Displays the first few rows of the data, useful for a quick preview.
Additional Examples:
Code
# Reading a CSV file with additional optionsair_data <-read.csv("Air_Quality.csv", na.strings =c("", "NA"), stringsAsFactors =FALSE)# Previewing the data structurestr(air_data)
- Handling Missing Data: By default, read.csv() interprets blank cells as NA (Not Available). You can control this with the na.strings parameter. - stringsAsFactors: Set to FALSE to prevent automatic conversion of character strings to factors.
1.2.2 Reading Excel Files
Excel files can be read into R using the readxl package, which provides functions like read_excel().
Code
# Installing and loading the readxl package# install.packages("readxl")library(readxl)# Reading an Excel fileev_data <-read_excel("Electric_Vehicle_Population_Data.xlsx")#ev_data <- read_excel("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")# Viewing the first few rows of the datahead(ev_data)
# A tibble: 6 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVRO… VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Explanation:
library(readxl): Loads the readxl package, necessary for reading Excel files.
read_excel(): Reads an Excel file into a data frame. Specify the file path and name.
Code
# Reading specific sheets from an Excel fileev_data <-read_excel("Electric_Vehicle_Population_Data.xlsx", sheet ="Sheet1")# Checking for multiple sheetssheets <-excel_sheets("Electric_Vehicle_Population_Data.xlsx")#sheets <- excel_sheets("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")sheets
[1] "Sheet1"
- Reading Specific Sheets: Use the sheet parameter in read_excel() to specify which sheet to read. - Handling Excel Formats: The package supports .xls and .xlsx formats. - excel_sheets(): Returns a vector of sheet names in the Excel file.
1.3.1 Exploring the Data Using Basic Statistical Functions
Once the data is loaded, basic statistical functions can help summarize and understand the data.
Code
# Summary statistics for Tuberculosis datasummary(air_data)
Unique_ID Indicator_ID Name Measure
Min. :121644 Min. :365 Length:14077 Length:14077
1st Qu.:174276 1st Qu.:365 Class :character Class :character
Median :410845 Median :375 Mode :character Mode :character
Mean :428717 Mean :453
3rd Qu.:650015 3rd Qu.:643
Max. :828353 Max. :661
Measure_Info Type_Name Year Place_Name
Length:14077 Length:14077 Min. :2005 Length:14077
Class :character Class :character 1st Qu.:2011 Class :character
Mode :character Mode :character Median :2015 Mode :character
Mean :2015
3rd Qu.:2019
Max. :2022
Data.Value
Min. : 0.00
1st Qu.: 8.80
Median : 15.10
Mean : 22.52
3rd Qu.: 28.00
Max. :424.70
Code
# Summary statistics for Electric Vehicle datasummary(ev_data)
VIN (1-10) County City State
Length:194232 Length:194232 Length:194232 Length:194232
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Postal_Code Model_Year Make Model
Min. : 1731 Min. :1997 Length:194232 Length:194232
1st Qu.:98052 1st Qu.:2019 Class :character Class :character
Median :98125 Median :2022 Mode :character Mode :character
Mean :98176 Mean :2021
3rd Qu.:98372 3rd Qu.:2023
Max. :99577 Max. :2025
NA's :9
Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility
Length:194232 Length:194232
Class :character Class :character
Mode :character Mode :character
Electric_Range Base MSRP Legislative District DOL Vehicle ID
Min. : 0.00 Min. : 0.0 Min. : 1.00 Min. : 4385
1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:17.00 1st Qu.:187225086
Median : 0.00 Median : 0.0 Median :33.00 Median :233940241
Mean : 54.84 Mean : 978.7 Mean :29.01 Mean :224892309
3rd Qu.: 68.00 3rd Qu.: 0.0 3rd Qu.:42.00 3rd Qu.:260115926
Max. :337.00 Max. :845000.0 Max. :49.00 Max. :479254772
NA's :2 NA's :2 NA's :432
Vehicle Location Electric Utility 2020 Census Tract
Length:194232 Length:194232 Min. :1.001e+09
Class :character Class :character 1st Qu.:5.303e+10
Mode :character Mode :character Median :5.303e+10
Mean :5.298e+10
3rd Qu.:5.305e+10
Max. :5.602e+10
NA's :9
Code
# Shows the details of data# View(air_data)# show first or last n rowshead(air_data, 5)
Unique_ID Indicator_ID Name Measure
1 179772 640 Boiler Emissions- Total SO2 Emissions Number per km2
2 179785 640 Boiler Emissions- Total SO2 Emissions Number per km2
3 178540 365 Fine particles (PM 2.5) Mean
4 178561 365 Fine particles (PM 2.5) Mean
5 823217 365 Fine particles (PM 2.5) Mean
Measure_Info Type_Name Year Place_Name Data.Value
1 number UHF42 2015 Southeast Queens 0.3
2 number UHF42 2015 Bensonhurst - Bay Ridge 1.2
3 mcg/m3 UHF42 2012 Bensonhurst - Bay Ridge 8.6
4 mcg/m3 UHF42 2012 Southeast Queens 8.0
5 mcg/m3 UHF42 2022 Southeast Queens 6.1
Code
tail(air_data, 10)
Unique_ID Indicator_ID Name Measure
14068 154601 643 Annual vehicle miles traveled Million miles
14069 154618 643 Annual vehicle miles traveled Million miles
14070 154604 643 Annual vehicle miles traveled Million miles
14071 816732 643 Annual vehicle miles traveled Million miles
14072 816731 643 Annual vehicle miles traveled Million miles
14073 816914 643 Annual vehicle miles traveled Million miles
14074 816913 643 Annual vehicle miles traveled Million miles
14075 816872 643 Annual vehicle miles traveled Million miles
14076 816832 643 Annual vehicle miles traveled Million miles
14077 151658 643 Annual vehicle miles traveled Million miles
Measure_Info Type_Name Year Place_Name
14068 per square mile CD 2005 Flatlands and Canarsie (CD18)
14069 per square mile CD 2005 Tottenville and Great Kills (CD3)
14070 per square mile CD 2005 Jackson Heights (CD3)
14071 per square mile CD 2010 Flatlands and Canarsie (CD18)
14072 per square mile CD 2019 Flatlands and Canarsie (CD18)
14073 per square mile CD 2019 Tottenville and Great Kills (CD3)
14074 per square mile CD 2010 Tottenville and Great Kills (CD3)
14075 per square mile UHF42 2010 Canarsie - Flatlands
14076 per square mile UHF42 2010 Southwest Queens
14077 per square mile UHF42 2005 Jamaica
Data.Value
14068 39.8
14069 13.0
14070 64.2
14071 38.5
14072 44.4
14073 12.9
14074 14.7
14075 43.4
14076 65.8
14077 41.0
- summary(): Provides a summary of each column in the data, including minimum, maximum, mean, and quartiles for numeric data.
Code
# Summary statistics for Tuberculosis datasummary(air_data)
Unique_ID Indicator_ID Name Measure
Min. :121644 Min. :365 Length:14077 Length:14077
1st Qu.:174276 1st Qu.:365 Class :character Class :character
Median :410845 Median :375 Mode :character Mode :character
Mean :428717 Mean :453
3rd Qu.:650015 3rd Qu.:643
Max. :828353 Max. :661
Measure_Info Type_Name Year Place_Name
Length:14077 Length:14077 Min. :2005 Length:14077
Class :character Class :character 1st Qu.:2011 Class :character
Mode :character Mode :character Median :2015 Mode :character
Mean :2015
3rd Qu.:2019
Max. :2022
Data.Value
Min. : 0.00
1st Qu.: 8.80
Median : 15.10
Mean : 22.52
3rd Qu.: 28.00
Max. :424.70
Code
# Checking for missing valuesmissing_values <-sum(is.na(air_data))# Data type conversion if neededair_data$Year <-as.factor(air_data$Year)summary(air_data)
Unique_ID Indicator_ID Name Measure
Min. :121644 Min. :365 Length:14077 Length:14077
1st Qu.:174276 1st Qu.:365 Class :character Class :character
Median :410845 Median :375 Mode :character Mode :character
Mean :428717 Mean :453
3rd Qu.:650015 3rd Qu.:643
Max. :828353 Max. :661
Measure_Info Type_Name Year Place_Name
Length:14077 Length:14077 2019 :1515 Length:14077
Class :character Class :character 2011 :1399 Class :character
Mode :character Mode :character 2014 :1281 Mode :character
2017 :1185
2010 :1167
2013 : 849
(Other):6681
Data.Value
Min. : 0.00
1st Qu.: 8.80
Median : 15.10
Mean : 22.52
3rd Qu.: 28.00
Max. :424.70
- Summary Statistics: Provide insights into the central tendency, spread, and shape of the data distribution. - Checking Data Quality: Identify missing values, outliers, and incorrect data types. - is.na(): Identifies NA values in the data. - as.factor(): Converts a variable to factor if necessary.
1.3.2 Exploring the Data Using Basic Statistical Functions
names() or colnames(): both show the names of columns of a data frame
dim(): returns the dimensions of data frame (i.e. number of rows and number of columns)
R allows you to export data frames to various formats, such as CSV or Excel. The write.csv() function is commonly used to save data as a CSV file.
Writing csv files
Code
# Writing data to a CSV filewrite.csv(air_data, "Processed_Air_Quality.csv", row.names =FALSE)
- write.csv(): Writes a data frame to a CSV file. Set row.names = FALSE to avoid including row numbers.
Writing Excel files
Code
# Installing and loading writexl package# install.packages("writexl")library(writexl)
Warning: package 'writexl' was built under R version 4.3.3
Code
# Writing data to an Excel filewrite_xlsx(ev_data, "Processed_Electric_Vehicle_Data.xlsx")
- write_xlsx(): Exports a data frame to an Excel file.
Session 2: Hands-on Exercises
2.1 Importing and Exploring Data
Q1
Import the Air_Quality data and view the column names.
Calculate the mean of a numeric column of your choice.
Identify the number of unique values in a categorical column.
Code
# Your code herecolumn_names <-colnames(air_data)mean_value <-mean(air_data$Data.Value, na.rm =TRUE)unique_values <-length(unique(air_data$Data.Value))
Q2
Import the Air_Quality data and check the number of rows and columns.
Calculate the median of a numeric column.
Count the number of missing values in the Electric Vehicle data.
Code
# Your code herenum_rows <-nrow(air_data)num_cols <-ncol(air_data)median_value <-median(air_data$Data.Value, na.rm =TRUE)missing_ev_data <-sum(is.na(ev_data))
Q3 Writing Data to Files
Export the Electric Vehicle data to a new CSV file.
Modify a column (e.g., replace missing values) and save the modified data.
Code
# Your code here# Export to CSVwrite.csv(ev_data, "Processed_Electric_Vehicle_Data.csv", row.names =FALSE)# Modify and save dataev_data$Postal_Code[is.na(ev_data$Postal_Code)] <-0write.csv(ev_data, "Modified_Electric_Vehicle_Data.csv", row.names =FALSE)
Session 3: Wrapping Up and Q&A
This session covered importing data from CSV and Excel files, exploring the data using basic statistical functions, and exporting data to different formats. Mastering these skills is crucial for data analysis workflows.
Continue practicing with different datasets and explore the various functionalities of R for data import/export.
Q&A: Feel free to ask questions or share your experiences with data import/export in R. Let’s discuss any interesting findings or challenges you’ve encountered.
---title: "EPPS Math Coding Camp"subtitle: "Importing and Exporting Data"author: "Karl Ho"date: "August 12, 2024"format: html: toc: true toc-depth: 3 code-fold: show code-tools: true highlight-style: github---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)```# Session 1: Importing and Exporting Data## 1.1 Working directoryThe working directory is the folder in which you are currently working.```{r}# 1.1.1 Direct import If the data is in the same folder as the current working directoryair_data <-read.csv("Air_Quality.csv")# 1.1.2 Setting working directory before direct import#setwd('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code')air_data <-read.csv("Air_Quality.csv")# 1.1.3 Reading by giving the absolute address#air_data <- read.csv('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code/Air_Quality.csv')```## 1.2 Loading data of various formats|file type|package|function||-|-|-||.csv|readr|read_csv()||.dta (stata)|haven|read_dta()||.xlsx|readxl|read_xlsx()|## 1.2.1 Reading CSV FilesCSV (Comma-Separated Values) files are commonly used for storing tabular data. In R, you can use the `read.csv()` function to import CSV files.### Example: Importing a CSV File```{r}# Reading a CSV fileair_data <-read.csv("Air_Quality.csv")#air_data <- read.csv("D:/2024Fall/DataCamp/Air_Quality.csv")# Viewing the first few rows of the datahead(air_data)```<br>#### Explanation: <br>- **read.csv()**: Reads a CSV file into a data frame. Specify the file path and name. <br>- **File Path**: The file path can be absolute or relative. Always ensure the correct path format, especially on different operating systems.- **head()**: Displays the first few rows of the data, useful for a quick preview. <br>### Additional Examples:```{r}# Reading a CSV file with additional optionsair_data <-read.csv("Air_Quality.csv", na.strings =c("", "NA"), stringsAsFactors =FALSE)# Previewing the data structurestr(air_data)```<br>- **Handling Missing Data**: By default, `read.csv()` interprets blank cells as `NA` (Not Available). You can control this with the `na.strings` parameter. <br>- **stringsAsFactors**: Set to FALSE to prevent automatic conversion of character strings to factors.<br>## 1.2.2 Reading Excel FilesExcel files can be read into R using the readxl package, which provides functions like read_excel(). <br>```{r}# Installing and loading the readxl package# install.packages("readxl")library(readxl)# Reading an Excel fileev_data <-read_excel("Electric_Vehicle_Population_Data.xlsx")#ev_data <- read_excel("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")# Viewing the first few rows of the datahead(ev_data)```<br>#### Explanation:- **library(readxl)**: Loads the readxl package, necessary for reading Excel files.<br>- **read_excel()**: Reads an Excel file into a data frame. Specify the file path and name.<br>```{r}# Reading specific sheets from an Excel fileev_data <-read_excel("Electric_Vehicle_Population_Data.xlsx", sheet ="Sheet1")# Checking for multiple sheetssheets <-excel_sheets("Electric_Vehicle_Population_Data.xlsx")#sheets <- excel_sheets("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")sheets```<br>- **Reading Specific Sheets**: Use the sheet parameter in read_excel() to specify which sheet to read. <br>- **Handling Excel Formats**: The package supports .xls and .xlsx formats.<br>- **excel_sheets()**: Returns a vector of sheet names in the Excel file. <br><br>## 1.3.1 Exploring the Data Using Basic Statistical FunctionsOnce the data is loaded, basic statistical functions can help summarize and understand the data.<br>```{r}# Summary statistics for Tuberculosis datasummary(air_data)# Summary statistics for Electric Vehicle datasummary(ev_data)# Shows the details of data# View(air_data)# show first or last n rowshead(air_data, 5)tail(air_data, 10)```<br>- **summary()**: Provides a summary of each column in the data, including minimum, maximum, mean, and quartiles for numeric data.<br>```{r}# Summary statistics for Tuberculosis datasummary(air_data)# Checking for missing valuesmissing_values <-sum(is.na(air_data))# Data type conversion if neededair_data$Year <-as.factor(air_data$Year)summary(air_data)```<br>- **Summary Statistics**: Provide insights into the central tendency, spread, and shape of the data distribution. <br>- **Checking Data Quality**: Identify missing values, outliers, and incorrect data types. <br>- **is.na()**: Identifies NA values in the data.<br>- **as.factor()**: Converts a variable to factor if necessary. <br><br>## 1.3.2 Exploring the Data Using Basic Statistical Functionsnames() or colnames(): both show the names of columns of a data framedim(): returns the dimensions of data frame (i.e. number of rows and number of columns)nrow(): number of rowsncol(): number of columns```{r}names(air_data)dim(air_data)nrow(air_data)ncol(air_data)```## 1.4 Writing Data to FilesR allows you to export data frames to various formats, such as CSV or Excel. The write.csv() function is commonly used to save data as a CSV file.<br>#### Writing csv files```{r}# Writing data to a CSV filewrite.csv(air_data, "Processed_Air_Quality.csv", row.names =FALSE)```<br>- **write.csv()**: Writes a data frame to a CSV file. Set row.names = FALSE to avoid including row numbers.<br>#### Writing Excel files```{r}# Installing and loading writexl package# install.packages("writexl")library(writexl)# Writing data to an Excel filewrite_xlsx(ev_data, "Processed_Electric_Vehicle_Data.xlsx")```<br>- **write_xlsx()**: Exports a data frame to an Excel file.<br># Session 2: Hands-on Exercises<br>## 2.1 Importing and Exploring Data#### Q11) Import the Air_Quality data and view the column names. <br>2) Calculate the mean of a numeric column of your choice. <br>3) Identify the number of unique values in a categorical column. <br><br>```{r}# Your code herecolumn_names <-colnames(air_data)mean_value <-mean(air_data$Data.Value, na.rm =TRUE)unique_values <-length(unique(air_data$Data.Value))```<br>#### Q21) Import the Air_Quality data and check the number of rows and columns. <br>2) Calculate the median of a numeric column. <br>3) Count the number of missing values in the Electric Vehicle data. <br>```{r}# Your code herenum_rows <-nrow(air_data)num_cols <-ncol(air_data)median_value <-median(air_data$Data.Value, na.rm =TRUE)missing_ev_data <-sum(is.na(ev_data))```<br>#### Q3 Writing Data to Files1) Export the Electric Vehicle data to a new CSV file. <br>2) Modify a column (e.g., replace missing values) and save the modified data.<br>```{r}# Your code here# Export to CSVwrite.csv(ev_data, "Processed_Electric_Vehicle_Data.csv", row.names =FALSE)# Modify and save dataev_data$Postal_Code[is.na(ev_data$Postal_Code)] <-0write.csv(ev_data, "Modified_Electric_Vehicle_Data.csv", row.names =FALSE)```<br># Session 3: Wrapping Up and Q&AThis session covered importing data from CSV and Excel files, exploring the data using basic statistical functions, and exporting data to different formats. Mastering these skills is crucial for data analysis workflows.<br>Continue practicing with different datasets and explore the various functionalities of R for data import/export.<br>**Q&A**: Feel free to ask questions or share your experiences with data import/export in R. Let's discuss any interesting findings or challenges you've encountered.<br>### Reference- https://datageneration.io/dataprogrammingwithr/intro- Chicago Harris School Coding Camp- Data.gov, the U.S. government's open data portal