Author

Karl Ho

Published

August 12, 2024

Data Structures in R

Why Data Structures Matter

Understanding data structures is crucial because they determine how we can interact with our data. Different structures are suited for different tasks, and knowing which to use can make your code more efficient and easier to read.

Creating and Manipulating Vectors

Vectors are a fundamental data structure in R, consisting of elements of the same type. They can hold numeric, character, or logical data.

Creating Vectors

Vectors can be created using the c() function or specific functions like seq() and rep().

Code
# Using c() function
characters <- c("Iron Man", "Superman", "Wonder Woman", "Batman", "Hulk")

# Using : operator for sequences
numbers <- 1:5

# Using seq() function
even_numbers <- seq(2, 10, by=2)

# Using rep() function
repeated_hero <- rep("Spider-Man", times=3)

print(characters)
print(numbers)
print(even_numbers)
print(repeated_hero)
Note

The c() function is versatile and can create vectors of any type. The : operator is a quick way to create integer sequences. seq() offers more control over sequences, and rep() is useful for creating vectors with repeated elements.

Vector Operations

Vectors allow various operations, such as indexing and logical operations:

Code
# Arithmetic operations
strengths <- c(90, 95, 85, 80, 88)
boosted_strengths <- strengths + 5
print(boosted_strengths)

# Logical operations
is_strong <- strengths > 85
print(is_strong)

# Indexing
print(characters[2])  # Access second element
print(characters[c(1,3,5)])  # Access multiple elements

# Vector recycling
short_vector <- c(1, 2)
long_vector <- 1:10
result <- short_vector + long_vector
print(result)
Important

Exercise 1: Vector Manipulation

  • Create a vector of 5 superhero ages.
  • Increase all ages by 2 years.
  • Find which heroes are older than 40 after the increase.
  • Create a logical vector indicating if each hero is from DC (assume the first 3 are from DC).

Introduction to Matrices

Matrices are two-dimensional arrays that store data of the same type. They can be useful for storing data in a structured format.

Creating Matrices

Use the matrix() function to create matrices. You can specify the data, number of rows, and number of columns.

Code
hero_stats <- matrix(
  c(90, 95, 85, 80, 88,   # Strength
    70, 95, 90, 85, 75),  # Intelligence
  nrow = 2, 
  ncol = 5, 
  byrow = TRUE
)

# Add row and column names
rownames(hero_stats) <- c("Strength", "Intelligence")
colnames(hero_stats) <- characters

print(hero_stats)
Note

The byrow = TRUE argument fills the matrix by rows. If omitted, it fills by columns.

Matrix Operations

Matrices support a variety of operations:

Code
# Element-wise addition
boosted_stats <- hero_stats + 5
print(boosted_stats)

# Matrix multiplication
scaled_stats <- hero_stats %*% diag(5)
print(scaled_stats)

# Transpose
transposed_stats <- t(hero_stats)
print(transposed_stats)

# Accessing elements
print(hero_stats[1, 3])  # Strength of Wonder Woman
print(hero_stats[, "Batman"])  # All stats for Batman
Important

Exercise 2: Matrix Manipulation

  • Create a 3x3 matrix of hero power levels (strength, speed, intelligence) for three new heroes.
  • Calculate the average power level for each hero.
  • Find which hero has the highest strength.
  • Scale all power levels by 1.5 and round to the nearest integer.

Introduction to Data Frames

Data frames are used to store tabular data and can contain columns of different types.

Creating Data Frames

Use the data.frame() function to create data frames. Each column can have a different type.

Code
# Create a data frame
hero_df <- data.frame(
  Name = characters,
  Strength = strengths,
  Intelligence = c(70, 95, 90, 85, 75),
  IsMarvel = c(TRUE, FALSE, FALSE, FALSE, TRUE)
)

print(hero_df)

Data Frame Operations

Data frames support various operations, including accessing, modifying, and summarizing data.

Code
# Adding a new column
hero_df$Speed <- c(85, 100, 90, 75, 70)

# Accessing columns
print(hero_df$Name)
print(hero_df[["Strength"]])

# Filtering rows
marvel_heroes <- hero_df[hero_df$IsMarvel, ]
print(marvel_heroes)

# Using subset()
strong_heroes <- subset(hero_df, Strength > 85)
print(strong_heroes)

# Sorting
sorted_heroes <- hero_df[order(hero_df$Intelligence, decreasing = TRUE), ]
print(sorted_heroes)
Note

Data frames combine the best of both worlds: they can store different types of data (like lists) but in a tabular format (like matrices).

Important

Exercise 3: Data Frame Manipulation

  • Add a “PowerLevel” column that’s the average of Strength, Intelligence, and Speed.
  • Filter the data frame to show only heroes with a PowerLevel above 85.
  • Sort the heroes by PowerLevel in descending order.
  • Create a new data frame with only the Name and PowerLevel columns for non-Marvel heroes.
Code
ages <- c(48, 35, 30, 40, 49)  # Example ages
average_age <- mean(ages)
older_characters <- characters[ages > 30]

average_age
older_characters

Advanced Topics and Best Practices

Factors

Factors are used for categorical data and can be ordered or unordered.

Code
hero_types <- factor(c("Mutant", "Alien", "Human", "God", "Mutant"),
                     levels = c("Human", "Mutant", "Alien", "God"),
                     ordered = TRUE)
print(hero_types)

Lists

Lists can contain elements of different types, including other lists.

Code
hero_list <- list(
  name = "Iron Man",
  stats = c(Strength = 85, Intelligence = 95, Speed = 70),
  equipment = c("Arc Reactor", "Iron Suit")
)
print(hero_list)
Note

Final Exercise: Create a comprehensive hero database:

  • Make a data frame with at least 10 heroes, including columns for Name, Type (factor), Strength, Intelligence, and Speed.
  • Add a PowerLevel column as before.
  • Create a list for each hero with their stats and a vector of their superpowers.

Use this data to answer questions like:

  • Who is the strongest hero of each type?
  • What’s the average PowerLevel by hero type?
  • Which hero has the most superpowers?

This exercise will test your ability to work with multiple data structures and perform various operations on them.

In R, data can be stored in various forms, each suitable for different types of analysis and operations. Understanding these data object types is fundamental to effectively utilizing R for data manipulation, statistical analysis, and programming.

This chapter introduces you to the essential data object types in R, providing a foundation for more advanced topics. We will cover the following:

  • Numeric Single Value (Scalar): A single numeric value, such as an integer or a floating-point number. Scalars are the building blocks of more complex data structures.

  • Character Single Value: A single character string, which can be used to store text data. Character values are often combined into vectors or used as labels in data frames and factors.

  • Vector: A sequence of elements of the same type, such as a series of numbers or a collection of character strings. Vectors are one of the most commonly used data structures in R.

  • Factor: A special type of vector used to represent categorical data. Factors are useful for storing data that takes on a limited number of discrete values, such as gender or education level.

  • Matrix: A two-dimensional array where each element is of the same type. Matrices are essential for mathematical operations and are often used in linear algebra.

  • Array: A multi-dimensional generalization of a matrix, allowing for data storage in more than two dimensions. Arrays are useful for complex data representations in fields like image processing and scientific computing.

  • List: A flexible data structure that can hold elements of different types, including vectors, matrices, and even other lists. Lists are powerful tools for managing diverse data within a single object.

  • Data Frame: A table-like structure where each column can contain different types of data. Data frames are central to data analysis in R, allowing for organized storage and manipulation of datasets.

  • Text Data Objects (e.g., dfm): Text data is increasingly important in social sciences and humanities. In R, specialized data structures such as Document-Feature Matrices (dfm) allow for the handling and analysis of text data. These structures enable tasks like text mining, sentiment analysis, and natural language processing.

Each of these data object types serves a unique purpose and has specific operations associated with it. Throughout this chapter, we will explore how to create, manipulate, and apply these data structures in real-world scenarios, equipping you with the skills needed to manage diverse data types in your research.

1. Numeric Single Value (Scalar)

A scalar in R represents a single numeric value, such as an integer or a floating-point number.

Code
# Creating a numeric scalar
numeric_scalar <- 42
numeric_scalar
[1] 42

2. Character Single Value

A character single value in R is used to store text or strings.

Code
# Creating a character scalar
character_scalar <- "Hello, R!"
character_scalar
[1] "Hello, R!"

3. Vector

A vector is a sequence of elements of the same type, commonly used for storing a series of numbers or character strings.

Code
# Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
numeric_vector
[1] 1 2 3 4 5
Code
# Creating a character vector
character_vector <- c("apple", "banana", "cherry")
character_vector
[1] "apple"  "banana" "cherry"

4. Factor

A factor is a special type of vector used to represent categorical data.

Code
# Creating a factor
fruit <- c("apple", "banana", "apple", "cherry", "banana")
factor_fruit <- factor(fruit)
factor_fruit
[1] apple  banana apple  cherry banana
Levels: apple banana cherry

5. Matrix

A matrix is a two-dimensional array where each element is of the same type.

Code
# Creating a matrix
matrix_data <- matrix(1:9, nrow = 3, ncol = 3)
matrix_data
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

6. Array

An array is a multi-dimensional generalization of a matrix.

Code
# Creating a 3-dimensional array
array_data <- array(1:12, dim = c(2, 3, 2))
array_data
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

7. List

A list is a flexible data structure that can hold elements of different types.

Code
# Creating a list
my_list <- list(name = "John Doe", age = 30, scores = c(90, 85, 88))
my_list
$name
[1] "John Doe"

$age
[1] 30

$scores
[1] 90 85 88

8. Data Frame

A data frame is a table-like structure where each column can contain different types of data.

Code
# Creating a data frame
data_frame <- data.frame(
  ID = 1:3,
  Name = c("Alice", "Bob", "Charlie"),
  Score = c(85, 92, 88)
)
data_frame
  ID    Name Score
1  1   Alice    85
2  2     Bob    92
3  3 Charlie    88

9. Text Data Objects (e.g., dfm)

Text data is increasingly important in social sciences and humanities. In R, specialized data structures like Document-Feature Matrices (dfm) are used for text analysis.

Code
# Example using the quanteda package to create a dfm
# Install the quanteda package if not already installed
# install.packages("quanteda")

library(quanteda)
Warning: package 'quanteda' was built under R version 4.3.3
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
Package version: 4.0.2
Unicode version: 15.1
ICU version: 74.1
Parallel computing: 16 of 16 threads used.
See https://quanteda.io for tutorials and examples.
Code
# Sample text data
texts <- c("This is a sample text", "Text analysis with R", "Learning R is fun")

# Creating a corpus
corpus <- dfm(tokens(texts))

# Creating a Document-Feature Matrix (dfm)
dfm_data <- dfm(corpus)
dfm_data
Document-feature matrix of: 3 documents, 10 features (56.67% sparse) and 0 docvars.
       features
docs    this is a sample text analysis with r learning fun
  text1    1  1 1      1    1        0    0 0        0   0
  text2    0  0 0      0    1        1    1 1        0   0
  text3    0  1 0      0    0        0    0 1        1   1

Exercise 1: Creating and Manipulating Vectors

Objective: Start creating and manipulating vectors in R.

Instructions:

  1. Create a numeric vector containing the numbers 10, 20, 30, 40, and 50.
  2. Create a character vector containing the names “Alice”, “Bob”, “Charlie”, “David”, and “Eve”.
  3. Use indexing to retrieve the third element from each of these vectors.
  4. Modify the second element in the numeric vector to be 25.
  5. Calculate the sum of all elements in the numeric vector.

Expected Output:

  • Numeric vector: c(10, 25, 30, 40, 50)
  • Character vector: c("Alice", "Bob", "Charlie", "David", "Eve")
  • Sum of the numeric vector: 155
Code
# Create numeric and character vectors
numeric_vector <- c(10, 20, 30, 40, 50)
character_vector <- c("Alice", "Bob", "Charlie", "David", "Eve")

# Retrieve the third element
numeric_vector[3]
[1] 30
Code
character_vector[3]
[1] "Charlie"
Code
# Modify the second element
numeric_vector[2] <- 25

# Calculate the sum of all elements
sum(numeric_vector)
[1] 155
Code
# Try renaming the objects?

Exercise 2: Working with Factors

Objective: Understand how to create and manipulate factors in R.

Instructions:

  1. Create a factor variable from the following vector: c("low", "medium", "high", "low", "medium", "high").
  2. Display the levels of the factor variable.
  3. Convert the factor levels to an ordered factor where “low” < “medium” < “high”.
  4. Create a bar plot to visualize the frequency of each level.

Expected Output:

  • Levels: c("low", "medium", "high")
  • Ordered factor levels: low < medium < high

R Code:

Code
# Create a factor variable
factor_variable <- factor(c("low", "medium", "high", "low", "medium", "high"))

# Display the levels
levels(factor_variable)
[1] "high"   "low"    "medium"
Code
# Convert to an ordered factor
ordered_factor <- factor(factor_variable, levels = c("low", "medium", "high"), ordered = TRUE)

# Create a bar plot
barplot(table(ordered_factor))

Exercise 3: Data Frame Operations

Objective: Learn how to create, access, and manipulate data frames.

Instructions:

  1. Create a data frame with the following columns: ID (1, 2, 3), Name (“Alice”, “Bob”, “Charlie”), and Score (85, 90, 88).
  2. Access the Name column and print it.
  3. Add a new column Pass that indicates whether the Score is greater than or equal to 90.
  4. Calculate the average Score for all students.

Expected Output:

  • Data frame with a new Pass column
  • Average score: 87.67

R Code:

Code
# Create a data frame
data_frame <- data.frame(
  ID = 1:3,
  Name = c("Alice", "Bob", "Charlie"),
  Score = c(85, 90, 88)
)

# Access the Name column
data_frame$Name
[1] "Alice"   "Bob"     "Charlie"
Code
# Add a new column
data_frame$Pass <- data_frame$Score >= 90

# Calculate the average score
mean(data_frame$Score)
[1] 87.66667

Exercise 4: Text Data Manipulation with stringr

Objective: Practice manipulating text data using the stringr package.

Instructions:

  1. Load the stringr package.
  2. Create a character string: "The quick brown fox jumps over the lazy dog".
  3. Count the number of words in the string.
  4. Extract the word “quick” from the string.
  5. Replace the word “lazy” with “energetic”.

Expected Output:

  • Number of words: 9
  • Extracted word: "quick"
  • Modified string: "The quick brown fox jumps over the energetic dog"

R Code:

Code
# Load the stringr package
library(stringr)

Attaching package: 'stringr'
The following object is masked _by_ '.GlobalEnv':

    fruit
Code
# Create a character string
text_string <- "The quick brown fox jumps over the lazy dog"

# Count the number of words
str_count(text_string, "\\w+")
[1] 9
Code
# Extract the word "quick"
str_extract(text_string, "quick")
[1] "quick"
Code
# Replace "lazy" with "crazy"
str_replace(text_string, "lazy", "crazy")
[1] "The quick brown fox jumps over the crazy dog"
Code
# Try the following and count again?
# text_string <- "The quick brown fox jumps over the lazy dog #"

Exercise 5: Creating and Analyzing a Document-Term Matrix (DTM)

Objective: Learn how to create and analyze a Document-Term Matrix using text data.

Instructions:

  1. Load the tm package.
  2. Create a small corpus using the following text documents:
    • “R is a programming language for data analysis.”
    • “Data analysis in R is powerful and flexible.”
    • “Learning R can be fun and rewarding.”
  3. Create a Document-Term Matrix (DTM) from the corpus.
  4. Inspect the DTM to see the term frequency matrix.
  5. Identify the term with the highest frequency across all documents.

Expected Output:

  • A DTM with term frequencies
  • The term with the highest frequency (e.g., “R”)

R Code:

Code
# Load the tm package
library(tm)
Warning: package 'tm' was built under R version 4.3.3
Loading required package: NLP

Attaching package: 'NLP'
The following objects are masked from 'package:quanteda':

    meta, meta<-

Attaching package: 'tm'
The following object is masked from 'package:quanteda':

    stopwords
Code
# Create a corpus
docs <- c("R is a programming language for data analysis.",
          "Data analysis in R is powerful and flexible.",
          "Learning R can be fun and rewarding.")
corpus <- Corpus(VectorSource(docs))

# Create a Document-Term Matrix
dtm <- DocumentTermMatrix(corpus)

# Inspect the DTM
inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 13)>>
Non-/sparse entries: 15/24
Sparsity           : 62%
Maximal term length: 11
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs analysis analysis. and can data flexible. for language powerful
   1        0         1   0   0    1         0   1        1        0
   2        1         0   1   0    1         1   0        0        1
   3        0         0   1   1    0         0   0        0        0
    Terms
Docs programming
   1           1
   2           0
   3           0
Code
# Find the term with the highest frequency
term_frequencies <- colSums(as.matrix(dtm))
most_frequent_term <- names(term_frequencies[which.max(term_frequencies)])
most_frequent_term
[1] "data"

Reference:

Nahhas, Ramzi W. 2024. An Introduction to R for Research