EPPS Math Coding Camp

Importing and Exporting Data

Author

Karl Ho

Published

August 12, 2024

Session 1: Importing and Exporting Data

1.1 Working directory

The working directory is the folder in which you are currently working.

Code

# 1.1.1 Direct import If the data is in the same folder as the current working directory

air_data <- read.csv("Air_Quality.csv")

# 1.1.2 Setting working directory before direct import

#setwd('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code')
air_data <- read.csv("Air_Quality.csv")

# 1.1.3 Reading by giving the absolute address

#air_data <- read.csv('D:/Summer Coding Camp/Jeong_CodingCamp_draft_code/Jeong_CodingCamp_draft_code/Air_Quality.csv')

1.2 Loading data of various formats

file type	package	function
.csv	readr	read_csv()
.dta (stata)	haven	read_dta()
.xlsx	readxl	read_xlsx()

1.2.1 Reading CSV Files

CSV (Comma-Separated Values) files are commonly used for storing tabular data. In R, you can use the read.csv() function to import CSV files.

Example: Importing a CSV File

Code

# Reading a CSV file

air_data <- read.csv("Air_Quality.csv")
#air_data <- read.csv("D:/2024Fall/DataCamp/Air_Quality.csv")




# Viewing the first few rows of the data
head(air_data)

  Unique_ID Indicator_ID                                  Name        Measure
1    179772          640 Boiler Emissions- Total SO2 Emissions Number per km2
2    179785          640 Boiler Emissions- Total SO2 Emissions Number per km2
3    178540          365               Fine particles (PM 2.5)           Mean
4    178561          365               Fine particles (PM 2.5)           Mean
5    823217          365               Fine particles (PM 2.5)           Mean
6    177910          365               Fine particles (PM 2.5)           Mean
  Measure_Info Type_Name Year              Place_Name Data.Value
1       number     UHF42 2015        Southeast Queens        0.3
2       number     UHF42 2015 Bensonhurst - Bay Ridge        1.2
3       mcg/m3     UHF42 2012 Bensonhurst - Bay Ridge        8.6
4       mcg/m3     UHF42 2012        Southeast Queens        8.0
5       mcg/m3     UHF42 2022        Southeast Queens        6.1
6       mcg/m3     UHF42 2012 Bensonhurst - Bay Ridge       10.0

Explanation:

read.csv(): Reads a CSV file into a data frame. Specify the file path and name.
File Path: The file path can be absolute or relative. Always ensure the correct path format, especially on different operating systems.
head(): Displays the first few rows of the data, useful for a quick preview.

Additional Examples:

Code

# Reading a CSV file with additional options

air_data <- read.csv("Air_Quality.csv", 
                    na.strings = c("", "NA"), 
                    stringsAsFactors = FALSE)



# Previewing the data structure
str(air_data)

'data.frame':   14077 obs. of  9 variables:
 $ Unique_ID   : int  179772 179785 178540 178561 823217 177910 177952 177973 177931 742274 ...
 $ Indicator_ID: int  640 640 365 365 365 365 365 365 365 365 ...
 $ Name        : chr  "Boiler Emissions- Total SO2 Emissions" "Boiler Emissions- Total SO2 Emissions" "Fine particles (PM 2.5)" "Fine particles (PM 2.5)" ...
 $ Measure     : chr  "Number per km2" "Number per km2" "Mean" "Mean" ...
 $ Measure_Info: chr  "number" "number" "mcg/m3" "mcg/m3" ...
 $ Type_Name   : chr  "UHF42" "UHF42" "UHF42" "UHF42" ...
 $ Year        : int  2015 2015 2012 2012 2022 2012 2013 2013 2012 2021 ...
 $ Place_Name  : chr  "Southeast Queens" "Bensonhurst - Bay Ridge" "Bensonhurst - Bay Ridge" "Southeast Queens" ...
 $ Data.Value  : num  0.3 1.2 8.6 8 6.1 10 9.8 9.8 9.6 7.2 ...

- Handling Missing Data: By default, read.csv() interprets blank cells as NA (Not Available). You can control this with the na.strings parameter.
- stringsAsFactors: Set to FALSE to prevent automatic conversion of character strings to factors.

1.2.2 Reading Excel Files

Excel files can be read into R using the readxl package, which provides functions like read_excel().

Code

# Installing and loading the readxl package
# install.packages("readxl")
library(readxl)

# Reading an Excel file
ev_data <- read_excel("Electric_Vehicle_Population_Data.xlsx")
#ev_data <- read_excel("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")



# Viewing the first few rows of the data
head(ev_data)

# A tibble: 6 × 17
  `VIN (1-10)` County    City         State Postal_Code Model_Year Make    Model
  <chr>        <chr>     <chr>        <chr>       <dbl>      <dbl> <chr>   <chr>
1 1C4RJXN66R   Snohomish Everett      WA          98204       2024 JEEP    WRAN…
2 KNDJX3AEXG   King      Renton       WA          98058       2016 KIA     SOUL 
3 5YJ3E1EA3L   King      Seattle      WA          98125       2020 TESLA   MODE…
4 1G1RC6S5XH   Kitsap    Port Orchard WA          98367       2017 CHEVRO… VOLT 
5 5UXTA6C09P   Snohomish Monroe       WA          98272       2023 BMW     X5   
6 1FMCU0EZXN   Yakima    Moxee        WA          98936       2022 FORD    ESCA…
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
#   `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
#   Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
#   `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
#   `2020 Census Tract` <dbl>

Explanation:

library(readxl): Loads the readxl package, necessary for reading Excel files.
read_excel(): Reads an Excel file into a data frame. Specify the file path and name.

Code

# Reading specific sheets from an Excel file
ev_data <- read_excel("Electric_Vehicle_Population_Data.xlsx", 
                      sheet = "Sheet1")



# Checking for multiple sheets
sheets <- excel_sheets("Electric_Vehicle_Population_Data.xlsx")
#sheets <- excel_sheets("D:/2024Fall/DataCamp/Electric_Vehicle_Population_Data.xlsx")
sheets

[1] "Sheet1"

- Reading Specific Sheets: Use the sheet parameter in read_excel() to specify which sheet to read.
- Handling Excel Formats: The package supports .xls and .xlsx formats.
- excel_sheets(): Returns a vector of sheet names in the Excel file.

1.3.1 Exploring the Data Using Basic Statistical Functions

Once the data is loaded, basic statistical functions can help summarize and understand the data.

Code

# Summary statistics for Tuberculosis data
summary(air_data)

   Unique_ID       Indicator_ID     Name             Measure         
 Min.   :121644   Min.   :365   Length:14077       Length:14077      
 1st Qu.:174276   1st Qu.:365   Class :character   Class :character  
 Median :410845   Median :375   Mode  :character   Mode  :character  
 Mean   :428717   Mean   :453                                        
 3rd Qu.:650015   3rd Qu.:643                                        
 Max.   :828353   Max.   :661                                        
 Measure_Info        Type_Name              Year       Place_Name       
 Length:14077       Length:14077       Min.   :2005   Length:14077      
 Class :character   Class :character   1st Qu.:2011   Class :character  
 Mode  :character   Mode  :character   Median :2015   Mode  :character  
                                       Mean   :2015                     
                                       3rd Qu.:2019                     
                                       Max.   :2022                     
   Data.Value    
 Min.   :  0.00  
 1st Qu.:  8.80  
 Median : 15.10  
 Mean   : 22.52  
 3rd Qu.: 28.00  
 Max.   :424.70

Code

# Summary statistics for Electric Vehicle data
summary(ev_data)

  VIN (1-10)           County              City              State          
 Length:194232      Length:194232      Length:194232      Length:194232     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  Postal_Code      Model_Year       Make              Model          
 Min.   : 1731   Min.   :1997   Length:194232      Length:194232     
 1st Qu.:98052   1st Qu.:2019   Class :character   Class :character  
 Median :98125   Median :2022   Mode  :character   Mode  :character  
 Mean   :98176   Mean   :2021                                        
 3rd Qu.:98372   3rd Qu.:2023                                        
 Max.   :99577   Max.   :2025                                        
 NA's   :9                                                           
 Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility
 Length:194232         Length:194232                                    
 Class :character      Class :character                                 
 Mode  :character      Mode  :character                                 
                                                                        
                                                                        
                                                                        
                                                                        
 Electric_Range     Base MSRP        Legislative District DOL Vehicle ID     
 Min.   :  0.00   Min.   :     0.0   Min.   : 1.00        Min.   :     4385  
 1st Qu.:  0.00   1st Qu.:     0.0   1st Qu.:17.00        1st Qu.:187225086  
 Median :  0.00   Median :     0.0   Median :33.00        Median :233940241  
 Mean   : 54.84   Mean   :   978.7   Mean   :29.01        Mean   :224892309  
 3rd Qu.: 68.00   3rd Qu.:     0.0   3rd Qu.:42.00        3rd Qu.:260115926  
 Max.   :337.00   Max.   :845000.0   Max.   :49.00        Max.   :479254772  
 NA's   :2        NA's   :2          NA's   :432                             
 Vehicle Location   Electric Utility   2020 Census Tract  
 Length:194232      Length:194232      Min.   :1.001e+09  
 Class :character   Class :character   1st Qu.:5.303e+10  
 Mode  :character   Mode  :character   Median :5.303e+10  
                                       Mean   :5.298e+10  
                                       3rd Qu.:5.305e+10  
                                       Max.   :5.602e+10  
                                       NA's   :9

Code

# Shows the details of data
# View(air_data)

# show first or last n rows
head(air_data, 5)

  Unique_ID Indicator_ID                                  Name        Measure
1    179772          640 Boiler Emissions- Total SO2 Emissions Number per km2
2    179785          640 Boiler Emissions- Total SO2 Emissions Number per km2
3    178540          365               Fine particles (PM 2.5)           Mean
4    178561          365               Fine particles (PM 2.5)           Mean
5    823217          365               Fine particles (PM 2.5)           Mean
  Measure_Info Type_Name Year              Place_Name Data.Value
1       number     UHF42 2015        Southeast Queens        0.3
2       number     UHF42 2015 Bensonhurst - Bay Ridge        1.2
3       mcg/m3     UHF42 2012 Bensonhurst - Bay Ridge        8.6
4       mcg/m3     UHF42 2012        Southeast Queens        8.0
5       mcg/m3     UHF42 2022        Southeast Queens        6.1

Code

tail(air_data, 10)

      Unique_ID Indicator_ID                          Name       Measure
14068    154601          643 Annual vehicle miles traveled Million miles
14069    154618          643 Annual vehicle miles traveled Million miles
14070    154604          643 Annual vehicle miles traveled Million miles
14071    816732          643 Annual vehicle miles traveled Million miles
14072    816731          643 Annual vehicle miles traveled Million miles
14073    816914          643 Annual vehicle miles traveled Million miles
14074    816913          643 Annual vehicle miles traveled Million miles
14075    816872          643 Annual vehicle miles traveled Million miles
14076    816832          643 Annual vehicle miles traveled Million miles
14077    151658          643 Annual vehicle miles traveled Million miles
         Measure_Info Type_Name Year                        Place_Name
14068 per square mile        CD 2005     Flatlands and Canarsie (CD18)
14069 per square mile        CD 2005 Tottenville and Great Kills (CD3)
14070 per square mile        CD 2005             Jackson Heights (CD3)
14071 per square mile        CD 2010     Flatlands and Canarsie (CD18)
14072 per square mile        CD 2019     Flatlands and Canarsie (CD18)
14073 per square mile        CD 2019 Tottenville and Great Kills (CD3)
14074 per square mile        CD 2010 Tottenville and Great Kills (CD3)
14075 per square mile     UHF42 2010              Canarsie - Flatlands
14076 per square mile     UHF42 2010                  Southwest Queens
14077 per square mile     UHF42 2005                           Jamaica
      Data.Value
14068       39.8
14069       13.0
14070       64.2
14071       38.5
14072       44.4
14073       12.9
14074       14.7
14075       43.4
14076       65.8
14077       41.0

- summary(): Provides a summary of each column in the data, including minimum, maximum, mean, and quartiles for numeric data.

Code

# Summary statistics for Tuberculosis data
summary(air_data)

   Unique_ID       Indicator_ID     Name             Measure         
 Min.   :121644   Min.   :365   Length:14077       Length:14077      
 1st Qu.:174276   1st Qu.:365   Class :character   Class :character  
 Median :410845   Median :375   Mode  :character   Mode  :character  
 Mean   :428717   Mean   :453                                        
 3rd Qu.:650015   3rd Qu.:643                                        
 Max.   :828353   Max.   :661                                        
 Measure_Info        Type_Name              Year       Place_Name       
 Length:14077       Length:14077       Min.   :2005   Length:14077      
 Class :character   Class :character   1st Qu.:2011   Class :character  
 Mode  :character   Mode  :character   Median :2015   Mode  :character  
                                       Mean   :2015                     
                                       3rd Qu.:2019                     
                                       Max.   :2022                     
   Data.Value    
 Min.   :  0.00  
 1st Qu.:  8.80  
 Median : 15.10  
 Mean   : 22.52  
 3rd Qu.: 28.00  
 Max.   :424.70

Code

# Checking for missing values
missing_values <- sum(is.na(air_data))

# Data type conversion if needed
air_data$Year <- as.factor(air_data$Year)

summary(air_data)

   Unique_ID       Indicator_ID     Name             Measure         
 Min.   :121644   Min.   :365   Length:14077       Length:14077      
 1st Qu.:174276   1st Qu.:365   Class :character   Class :character  
 Median :410845   Median :375   Mode  :character   Mode  :character  
 Mean   :428717   Mean   :453                                        
 3rd Qu.:650015   3rd Qu.:643                                        
 Max.   :828353   Max.   :661                                        
                                                                     
 Measure_Info        Type_Name              Year       Place_Name       
 Length:14077       Length:14077       2019   :1515   Length:14077      
 Class :character   Class :character   2011   :1399   Class :character  
 Mode  :character   Mode  :character   2014   :1281   Mode  :character  
                                       2017   :1185                     
                                       2010   :1167                     
                                       2013   : 849                     
                                       (Other):6681                     
   Data.Value    
 Min.   :  0.00  
 1st Qu.:  8.80  
 Median : 15.10  
 Mean   : 22.52  
 3rd Qu.: 28.00  
 Max.   :424.70

- Summary Statistics: Provide insights into the central tendency, spread, and shape of the data distribution.
- Checking Data Quality: Identify missing values, outliers, and incorrect data types.
- is.na(): Identifies NA values in the data.
- as.factor(): Converts a variable to factor if necessary.

1.3.2 Exploring the Data Using Basic Statistical Functions

names() or colnames(): both show the names of columns of a data frame

dim(): returns the dimensions of data frame (i.e. number of rows and number of columns)

nrow(): number of rows

ncol(): number of columns

Code

names(air_data)

[1] "Unique_ID"    "Indicator_ID" "Name"         "Measure"      "Measure_Info"
[6] "Type_Name"    "Year"         "Place_Name"   "Data.Value"

Code

dim(air_data)

[1] 14077     9

Code

nrow(air_data)

[1] 14077

Code

ncol(air_data)

[1] 9

1.4 Writing Data to Files

R allows you to export data frames to various formats, such as CSV or Excel. The write.csv() function is commonly used to save data as a CSV file.

Writing csv files

Code

# Writing data to a CSV file
write.csv(air_data, "Processed_Air_Quality.csv", 
          row.names = FALSE)

- write.csv(): Writes a data frame to a CSV file. Set row.names = FALSE to avoid including row numbers.

Writing Excel files

Code

# Installing and loading writexl package
# install.packages("writexl")
library(writexl)

Warning: package 'writexl' was built under R version 4.3.3

Code

# Writing data to an Excel file
write_xlsx(ev_data, "Processed_Electric_Vehicle_Data.xlsx")

- write_xlsx(): Exports a data frame to an Excel file.

Session 2: Hands-on Exercises

2.1 Importing and Exploring Data

Q1

Import the Air_Quality data and view the column names.
Calculate the mean of a numeric column of your choice.
Identify the number of unique values in a categorical column.

Code

# Your code here
column_names <- colnames(air_data)
mean_value <- mean(air_data$Data.Value, na.rm = TRUE)
unique_values <- length(unique(air_data$Data.Value))

Q2

Import the Air_Quality data and check the number of rows and columns.
Calculate the median of a numeric column.
Count the number of missing values in the Electric Vehicle data.

Code

# Your code here
num_rows <- nrow(air_data)
num_cols <- ncol(air_data)
median_value <- median(air_data$Data.Value, na.rm = TRUE)
missing_ev_data <- sum(is.na(ev_data))

Q3 Writing Data to Files

Export the Electric Vehicle data to a new CSV file.
Modify a column (e.g., replace missing values) and save the modified data.

Code

# Your code here
# Export to CSV
write.csv(ev_data, "Processed_Electric_Vehicle_Data.csv", 
          row.names = FALSE)

# Modify and save data
ev_data$Postal_Code[is.na(ev_data$Postal_Code)] <- 0
write.csv(ev_data, "Modified_Electric_Vehicle_Data.csv", 
          row.names = FALSE)

Session 3: Wrapping Up and Q&A

This session covered importing data from CSV and Excel files, exploring the data using basic statistical functions, and exporting data to different formats. Mastering these skills is crucial for data analysis workflows.

Continue practicing with different datasets and explore the various functionalities of R for data import/export.

Q&A: Feel free to ask questions or share your experiences with data import/export in R. Let’s discuss any interesting findings or challenges you’ve encountered.

Reference

https://datageneration.io/dataprogrammingwithr/intro
Chicago Harris School Coding Camp
Data.gov, the U.S. government’s open data portal