Introduction

This guide is divided into two parts. The first part provides a basic introduction to the R programming language, while the second part focuses on practical code snippets for creating network visualizations, generating statistics and performing analysis using the igraph package.

Section 1: An introduction to R.

Getting Started with R

To become proficient in R, it’s helpful to think of coding in a way similar to learning a language. Start with the fundamentals, which are the basic operators. Operators are symbols that instruct the computer to perform specific actions. For example, in 1 + 2, the + operator performs addition. In a <- 1:5, the <- operator assigns values. Begin by familiarizing yourself with these operators; you don’t need to memorize them all at once. Focus on the ones you encounter frequently, and gradually expand your knowledge.

Key Concepts

As you gain confidence with operators, move on to writing your own code snippets. To do this effectively, understand the fundamental rules of R:

  • R is Case Sensitive: Pay attention to letter case; uppercase and lowercase letters are treated differently.
  • R Executes Code Sequentially: R processes code from top to bottom, so the order of your commands matters.
  • R Reads Left to Right: Code is evaluated from left to right, so the sequence of operations is crucial.

Learn about essential data structures like vectors, matrices, data frames, lists, and arrays, and how to manipulate them. For example, 2 + 1L may not be a valid operation, but you can learn how to make it valid. Understanding object classes and subsetting data within objects is crucial.

Learning by Doing

The best way to learn is by doing. When you have a clear, step-by-step plan in mind, there’s likely a way to code it in R.

Mastering R-base

Familiarize yourself with the R-base, which comprises core functions that don’t require additional packages. This forms the foundation of your knowledge. New users often make the mistake of installing numerous unnecessary packages. Keep it simple and use additional packages like igraph for specialized tasks, such as network analysis, only when necessary.

Operators Reference

Operators are symbols that provide instructions to the computer for specific tasks, such as variable manipulation, statement evaluation, function creation, and general operations. You can find a comprehensive list of operators in the R documentation.

  Logical Operators
- Minus, can be unary or binary
+ Plus, can be unary or binary
! Logical not (Negation)
~ Tilde (used in model formulae)
? Help
: Sequence, binary (in model formulae: interaction)
* Multiplication, binary
/ Division, binary
^ Exponentiation, binary
%x% Special binary operators, x can be replaced by any valid name
%% Modulus, binary
%/% Integer divide, binary
%*% Matrix product, binary
%o% Outer product, binary
%x% Kronecker product, binary
%in% Matching operator, binary (in model formulae: nesting)
< Less than, binary
> Greater than, binary
== Equal to, binary
>= Greater than or equal to, binary
<= Less than or equal to, binary
& And, binary, vectorized
&& And, binary, not vectorized
| Or, binary, vectorized
|| Or, binary, not vectorized
<- Left assignment, binary
-> Right assignment, binary
$ List subset, binary

R Operators

Understanding the basic syntax and notation in R is crucial to effectively navigate and utilize the language. In this example, we’ll explore the importance of this understanding while demonstrating the use of operators for algebraic and logical operations.

We can use various operators in R to perform fundamental algebraic and logical operations. It’s essential to be familiar with the basic syntax of the language, including elements like semicolons and parentheses.

1 + 5       # Addition
## [1] 6
5 * 6       # Multiplication
## [1] 30
4 ^ -1      # Exponentiation
## [1] 0.25
3 / 2       # Division
## [1] 1.5
4 / (6 * 6) * (2 - 4)  # Complex arithmetic expression
## [1] -0.2222222
# Integer division
6 %/% 4
## [1] 1
# Returns the remainder
6 %% 4
## [1] 2
4:7         # Create a sequence of numbers
## [1] 4 5 6 7
# Logical Statements

(TRUE == FALSE) == FALSE
## [1] TRUE
(F == F) == T
## [1] TRUE
4 > 5
## [1] FALSE
7 < 2
## [1] FALSE
(6 * 7) == (7 * 6)
## [1] TRUE
c(2, 3) == c(3, 2)
## [1] FALSE FALSE
c(3, 2) == c(3, 2)
## [1] TRUE TRUE
(3 + 2) & (2 + 3) == 5
## [1] TRUE
# Using |
vector1 <- c(TRUE, FALSE, TRUE)
vector2 <- c(FALSE, TRUE, TRUE)

# Element-wise logical OR using |
result1 <- vector1 | vector2

# Using in
c(2, 3) %in% c(2, 4, 3)
## [1] TRUE TRUE

Understanding Objects in R Programming

R is a powerful programming language known for its object-based approach. In practical terms, this means that every piece of data in R, apart from operators and syntax, is treated as an object with specific attributes. These attributes include class, structure, typeof, length, dimension, and structure. To effectively work with R, it’s crucial to grasp the fundamental concept of objects and how they function within the language. Let’s dive into some of the foundational aspects of objects in R.

Vectors: The Building Blocks

In R, vectors are the fundamental building blocks of data. They are often referred to as atomic vectors because they can hold elements of a single data type. Here are some key points about vectors:

Empty Vectors: You can create empty vectors using the NULL keyword.

z <- NULL

Numeric Vectors: Numeric vectors store numerical values, and you can assign names to elements within a vector. R

a <- c('a' = 2.3, 'b' = 4)

Integers: R also supports integer vectors, which can be created using the L suffix.

b <- c(2L, 9L)

Logical Vectors: Logical vectors store TRUE and FALSE values.

d <- c(TRUE, FALSE)

Character Vectors: Character vectors hold text or character data.

e <- c("A", 'B')

Factor: Factors are used to represent categorical variables. They have levels and can be ordered or unordered.

f <- factor(1:2, levels = c('male', "female"))

Operations on Vectors: You can perform various operations on vectors, such as addition, multiplication, and more. Vectors are the foundation for more complex data structures in R, and understanding their properties and manipulation is essential.

4 * a
##    a    b 
##  9.2 16.0
(a) ^ -1
##         a         b 
## 0.4347826 0.2500000
a1 <- c(4, 7)
names(a1) <- c('a', 'b')

a + a1
##    a    b 
##  6.3 11.0

Stay tuned as we explore more about matrices, data frames, functions, and lists in the world of R programming. These concepts will further enhance your ability to work with data effectively in R.

Matrices

Matrices are 2-dimensional arrays of data consisting of a single atomic object. They are essential for conducting statistical analyses and algorithms that involve mathematical manipulations. One crucial aspect of matrices is that their type is determined by a single atomic object.

Let’s create a matrix with numeric vector elements and examine its type using the typeof function:

# Matrices
# Basic
A <- matrix(1:9, ncol = 3, byrow = TRUE)
class(A)
## [1] "matrix" "array"
typeof(A)
## [1] "integer"
# Add a column with character elements
Z <- matrix(c(1:9, LETTERS[1:3]), ncol = 4, byrow = TRUE)
class(Z)
## [1] "matrix" "array"
typeof(Z)
## [1] "character"
# Math operators don't work.
Z + Z
## Error in Z + Z: argumento no-numérico para operador binario
# Change the elements of the matrix
A[upper.tri(A)] <- 1
A[lower.tri(A)] <- 2
diag(A) <- 3
A
##      [,1] [,2] [,3]
## [1,]    3    1    1
## [2,]    2    3    1
## [3,]    2    2    3
# Combining vectors by column
B <- cbind(2:0, 1:3, 0:2)
B
##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    3    2
# Combining vectors by row
C <- rbind(1:3, 4:6, 7:9)
C
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Basic Linear Algebra

We can perform basic linear algebra operations on matrices:

# Basic Linear Algebra
# Vector Operations
4 * a
##    a    b 
##  9.2 16.0
(a) ^ -1
##         a         b 
## 0.4347826 0.2500000
a + a1
##    a    b 
##  6.3 11.0
# Matrix Transpose
t(A)
##      [,1] [,2] [,3]
## [1,]    3    2    2
## [2,]    1    3    2
## [3,]    1    1    3
# Matrix Addition
A + B - C
##      [,1] [,2] [,3]
## [1,]    4    0   -2
## [2,]   -1    0   -4
## [3,]   -5   -3   -4
# Dot Product
A %*% B
##      [,1] [,2] [,3]
## [1,]    7    8    3
## [2,]    7   11    5
## [3,]    6   15    8
# Cross Product
t(A) %*% B == crossprod(A, B)
##      [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
# Inverse
C <- matrix(c(39L, 8L, 71L, 72L, 54L, 42L, 76L, 77L, 15L), ncol = 3)
D <- solve(C)
C %*% D
##      [,1]          [,2] [,3]
## [1,]    1  0.000000e+00    0
## [2,]    0  1.000000e+00    0
## [3,]    0 -4.440892e-16    1
round(C %*% D)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
# Eigenvalues and Eigenvectors
eigen(C)
## eigen() decomposition
## $values
## [1] 147.741703 -34.981904  -4.759798
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.6935491 -0.1529705  0.3017373
## [2,] -0.4916846 -0.6387581 -0.7727752
## [3,] -0.5265319  0.7540478  0.5583664
e <- eigen(C)$vector
v <- eigen(C)$value
C %*% e[, 1] == v[1] * e[, 1]
##       [,1]
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
all.equal(as.vector(C %*% e[, 1]), v[1] * e[, 1])
## [1] TRUE

Data Frames

Data frames have a more heterogeneous structure compared to matrices. While vectors and matrices belong to a specific typeof object, data frames can have multiple data types in each column.

## Basic Data Frame
df <- data.frame(
  A = LETTERS[1:5],
  B = factor(letters[1:5]),
  C = 1L:5L,
  D = c(2.4, 2, 3, 9, 7)
)

# Structure
str(df)
## 'data.frame':    5 obs. of  4 variables:
##  $ A: chr  "A" "B" "C" "D" ...
##  $ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##  $ C: int  1 2 3 4 5
##  $ D: num  2.4 2 3 9 7
# Basic statistics
summary(df)
##       A             B           C           D       
##  Length:5           a:1   Min.   :1   Min.   :2.00  
##  Class :character   b:1   1st Qu.:2   1st Qu.:2.40  
##  Mode  :character   c:1   Median :3   Median :3.00  
##                     d:1   Mean   :3   Mean   :4.68  
##                     e:1   3rd Qu.:4   3rd Qu.:7.00  
##                           Max.   :5   Max.   :9.00
# Print head
head(df, 3)
##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0
# Print tail
tail(df)
##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0
## 4 D d 4 9.0
## 5 E e 5 7.0
## Bipartite Projection
bp <- data.frame(papers = c(rep('A', 3), rep('B', 2), 'C'), authors = c(1:3, 2:3, 4))
bp
##   papers authors
## 1      A       1
## 2      A       2
## 3      A       3
## 4      B       2
## 5      B       3
## 6      C       4
# Incidence Matrix
py <- table(bp)
py
##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1
# Adjacency Matrix
py <- crossprod(py)
py
##        authors
## authors 1 2 3 4
##       1 1 1 1 0
##       2 1 2 2 0
##       3 1 2 2 0
##       4 0 0 0 1

Functions

Functions are invaluable when we need to perform the same operation(s) multiple times. Let’s create a simple function to calculate the degree from an adjacency matrix:

n <- 5
A <- matrix(sample(0:1, n * n, replace = TRUE), ncol = n)
rownames(A) <- LETTERS[1:n]
colnames(A) <- LETTERS[1:n]

# Remove loops
diag(A) <- 0

s.degree <- function(x) {
  n <- ncol(x)
  d <- x %*% rep(1, n)
  colnames(d) <- 'Degree'
  d
}

s.degree(A)
##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1

Lists

Lists are the most flexible data structure in R, allowing us to store multiple objects of different classes. A data frame is a list with a specific structure. We can use the dput function to print and store the structure of any object, which helps in creating reproducible examples.

# Print the structure of the data frame
dput(df)
## structure(list(A = c("A", "B", "C", "D", "E"), B = structure(1:5, levels = c("a", 
## "b", "c", "d", "e"), class = "factor"), C = 1:5, D = c(2.4, 2, 
## 3, 9, 7)), class = "data.frame", row.names = c(NA, -5L))
# Store a vector, matrix, data frame, function, and a list together
s <- list(c(1:3))
l <- list(
  factor = f,
  matrix = A,
  data.frame = df,
  list = s
)
str(l)
## List of 4
##  $ factor    : Factor w/ 2 levels "male","female": NA NA
##  $ matrix    : num [1:5, 1:5] 0 1 0 0 1 1 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##  $ data.frame:'data.frame':  5 obs. of  4 variables:
##   ..$ A: chr [1:5] "A" "B" "C" "D" ...
##   ..$ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##   ..$ C: int [1:5] 1 2 3 4 5
##   ..$ D: num [1:5] 2.4 2 3 9 7
##  $ list      :List of 1
##   ..$ : int [1:3] 1 2 3

Indexing Objects

Subsetting in R can be done using nominal, numeric, or logical indexing. Data frames and lists use the special operator $ for subsetting.

### Nominal ####
## Vectors ##
names(a)
## [1] "a" "b"
a['a']
##   a 
## 2.3
a['b']
## b 
## 4
## Matrices ##
A[c('A', 'C'), c('D', 'E')]
##   D E
## A 1 1
## C 0 0
## Data Frames ##
df[c('A', 'D')]
##   A   D
## 1 A 2.4
## 2 B 2.0
## 3 C 3.0
## 4 D 9.0
## 5 E 7.0
## Lists ##
l[c('factor', 'matrix')]
## $factor
## [1] <NA> <NA>
## Levels: male female
## 
## $matrix
##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0
### Numeric ####

## Vectors ##
a[1]
##   a 
## 2.3
## Matrices ##
A[2:3, 4]
## B C 
## 1 0
## Data Frames ##
df[1:5, 2:3]
##   B C
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5
## Lists ##
# Extract the data frame
l[unlist(lapply(l, class)) == 'data.frame']
## $list
## $list[[1]]
## [1] 1 2 3
### Logical ####

## Vectors ##
a[c(TRUE, FALSE)]
##   a 
## 2.3
## Matrices ##
A[upper.tri(A)]
##  [1] 1 0 0 1 1 0 1 0 0 0
## Data Frames ##
df[, c(rep(FALSE, 3), TRUE)]
## [1] 2.4 2.0 3.0 9.0 7.0
## Lists ##
# Extract the data frame
l$data.frame$C
## [1] 1 2 3 4 5
l$matrix[, 4]
## A B C D E 
## 1 1 0 0 0
### Combinations ###
A[2:3, c('C', 'D')]
##   C D
## B 0 1
## C 0 0
### Special Operator ####
# Subset a Column
df$A
## [1] "A" "B" "C" "D" "E"
# Subset the data frame in a list and print column D
l$data.frame$C
## [1] 1 2 3 4 5
l$matrix[, 4]
## A B C D E 
## 1 1 0 0 0

Control Flow

Control flow structures like if, else, and ifelse are essential for making decisions and executing code conditionally in R.

### Basic Structure if|else ####
condition <- 7
if (condition == 7) {
  print('Yes, it is...')
}
## [1] "Yes, it is..."
# Check if a graph is connected
is.connected <- function(am) {
  d <- s.degree(am)
  if (all(d > 0)) {
    print('Graph is connected')
  } else {
    print('Graph is disconnected')
  }
}

py <- table(bp)
py
##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1
is.connected(py)
## [1] "Graph is connected"
# Evaluate multiple conditions (and, or)
is.sim_multi <- function(am) {
  mult.ed <- any(am > 1)
  loops <- sum(diag(am)) != 0
  type <- c('The graph has:', 'multi edges', 'and loops.')
  if (mult.ed | loops) {
    am[am > 1] <- 1
    diag(am) <- 0
    print(paste(type[c(TRUE, mult.ed, loops)], collapse = " "))
  } else {
    print("The graph is simple")
  }
}

is.sim_multi(B)
## [1] "The graph has: multi edges and loops."
is.sim_multi(A)
## [1] "The graph is simple"
# Count the number of edges or vertices
no.ver.edges <- function(am) {
  v <- ncol(am)
  e <- sum(am > 0)
  if (e > v) {
    print(paste('Edges:', e))
  } else if (e < v) {
    print(paste('Vertices:', v))
  } else {
    paste('Vertices and Edges:', v)
  }
}

no.ver.edges(A)
## [1] "Edges: 7"
no.ver.edges(B)
## [1] "Edges: 7"
#### ifelse function ####
# ifelse function is efficient and partially vectorized
# It produces an output of the same length as the input.

ifelse(4 > 7, "YES", "NO")
## [1] "NO"
ifelse(7 > 4, "YES", "NO")
## [1] "YES"
# Nested ifelse
is.sym <- function(am) {
  ifelse(ncol(am) != nrow(am),
    'Not symmetric',
    ifelse(all(am[upper.tri(am)] == am[lower.tri(am)]), 'Symmetric', 'Squared'))
}

A
##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0
is.sym(A)
## [1] "Squared"
B[3, 2] <- 1
B
##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    1    2
is.sym(B)
## [1] "Symmetric"

Loops

Loops are used for repetitive tasks, but it’s essential to use them judiciously as they can be inefficient. Here, we cover while and for loops:

### while loop ####
fibo <- c(1, 2)
digi <- length(fibo)

# Create a Fibonacci Sequence and stop when it reaches 10 digits
while (digi < 4) {
  digi <- length(fibo)
  fibo[digi + 1] <- sum(fibo[digi - 1], fibo[digi])
  print(paste("Fibonacci Seq:", fibo[digi]))
}
## [1] "Fibonacci Seq: 2"
## [1] "Fibonacci Seq: 3"
## [1] "Fibonacci Seq: 5"
### for loop ####
# Get adjacent vertices (neighborhood)
i <- 1
nams <- row.names(A)

for (i in 1:nrow(A)) {
  print(nams[A[i, ] > 0])
}
## [1] "B" "D" "E"
## [1] "A" "D"
## character(0)
## [1] "C"
## [1] "A"

Apply Family of Functions

The apply function in R takes an array as its first argument and applies a function to all the elements of the array. Let’s explore some examples:

# Example of summing all the columns of a matrix
ma <- matrix(sample(1:100, 25), ncol = 5, nrow = 5)

# Using a loop
col.cum <- vector('numeric', length = 0)
for (c in 1:ncol(ma)) {
  col.cum <- c(col.cum, sum(ma[, c]))
}

# Using the apply function
apply(ma, 2, sum) == col.cum
## [1] TRUE TRUE TRUE TRUE TRUE

In this example, we create a matrix ma and calculate the sum of each column using both a loop and the apply function. The apply function provides a more concise and efficient way to perform this operation.

# Example of summing each row of a matrix

# Using a loop
row.cum <- vector('numeric', length = 0)
for (r in 1:nrow(ma)) {
  row.cum <- c(row.cum, sum(ma[r, ]))
}

# Using the apply function
apply(ma, 1, sum) == row.cum
## [1] TRUE TRUE TRUE TRUE TRUE
# Using linear algebra (For simpler functions, it is better to use linear algebra)
apply(ma, 1, sum) == row.cum & rowSums(ma) == row.cum
## [1] TRUE TRUE TRUE TRUE TRUE

In this section, we demonstrate how to sum each row of a matrix, first using a loop and then using the apply function. Additionally, we show how you can achieve the same result using linear algebra operations for efficiency.

# Example: Count how many times a string [A-] appears in each column

ma <- matrix(replicate(5, sample(LETTERS[1:10], 5)), ncol = 5, nrow = 5, byrow = TRUE)
lvls <- unique(c(ma))
apply(ma, 2, function(x) {
  table(factor(x, levels = lvls))
})
##   [,1] [,2] [,3] [,4] [,5]
## J    1    1    0    0    0
## F    2    1    0    0    0
## G    1    1    0    0    0
## I    1    0    1    0    2
## D    0    1    0    1    0
## A    0    1    1    0    0
## E    0    0    1    1    0
## C    0    0    2    0    2
## B    0    0    0    1    1
## H    0    0    0    2    0

In this example, we create a matrix of random characters and count how many times each character appears in each column using the apply function.

lapply Function

The lapply function in R takes a list as its first argument and applies a function to all the elements of the list. It offers advantages such as improved code readability and flexibility compared to the apply function.

### lapply Examples ###

# Heterogeneous list example
lapply(list(data.frame(1:10), 20:30), sum)
## [[1]]
## [1] 55
## 
## [[2]]
## [1] 275
# Homogeneous list example
lapply(list(A, B, C, D), s.degree)
## [[1]]
##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1
## 
## [[2]]
##      Degree
## [1,]      3
## [2,]      4
## [3,]      3
## 
## [[3]]
##      Degree
## [1,]    187
## [2,]    139
## [3,]    128
## 
## [[4]]
##           Degree
## [1,]  0.04585366
## [2,] -0.07556911
## [3,]  0.06121951
# Since a data.frame is a list, we can apply functions directly
# Check the list of sample data.frames ?data
# Load data
data(attitude)

lapply(attitude, function(x) {
  c(
    mean = mean(x),
    var = var(x),
    min = min(x),
    max = max(x),
    median = median(x)
  )
})
## $rating
##      mean       var       min       max    median 
##  64.63333 148.17126  40.00000  85.00000  65.50000 
## 
## $complaints
##     mean      var      min      max   median 
##  66.6000 177.2828  37.0000  90.0000  65.0000 
## 
## $privileges
##      mean       var       min       max    median 
##  53.13333 149.70575  30.00000  83.00000  51.50000 
## 
## $learning
##      mean       var       min       max    median 
##  56.36667 137.75747  34.00000  75.00000  56.50000 
## 
## $raises
##      mean       var       min       max    median 
##  64.63333 108.10230  43.00000  88.00000  63.50000 
## 
## $critical
##     mean      var      min      max   median 
## 74.76667 97.90920 49.00000 92.00000 77.50000 
## 
## $advance
##      mean       var       min       max    median 
##  42.93333 105.85747  25.00000  72.00000  41.00000

Here, we showcase various uses of lapply. It can be applied to both heterogeneous and homogeneous lists. When working with data frames, you can directly apply functions to columns, which can lead to more readable code.

# Similar to summary(attitude)

# Try to arrange the structure for better readability (not always successful)
t <- sapply(attitude, function(x) {
  c(
    mean = mean(x),
    var = var(x),
    min = min(x),
    max = max(x),
    median = median(x)
  )
})
class(t)
## [1] "matrix" "array"

In this section, we demonstrate a similar approach to the summary(attitude) function using sapply to provide a structured summary of the data.

Graphics in R

R offers a robust environment for creating graphics, making it a powerful tool for both statistical analysis and data visualization. To explore its capabilities, you can start by running demo(graphics) in the R console. Additionally, you can refer to this cheatsheet for an overview of the main plotting functions.

# demo(graphics)

Let’s delve into various aspects of graphics in R.

Color Management

Managing color spaces in R is essential for creating visually appealing graphics. Colors can be defined in three different ways: by name, by hexadecimal values, or by RGB values. You can explore a wide range of colors and conversions between these systems on this website.

For this tutorial a palette of high contrasting colors that I am defining in the following vector:

colors37 = c("#466791","#60bf37","#953ada","#4fbe6c","#ce49d3","#a7b43d","#5a51dc","#d49f36","#552095","#507f2d","#db37aa","#84b67c","#a06fda","#df462a","#5b83db","#c76c2d","#4f49a3","#82702d","#dd6bbb","#334c22","#d83979","#55baad","#dc4555","#62aad3","#8c3025","#417d61","#862977","#bba672","#403367","#da8a6d","#a79cd4","#71482c","#c689d0","#6b2940","#d593a7","#895c8b","#bd5975")

And in this snipped of code where you can see clearly the contrast in the palette:

# Example of hexadecimal format
# print(head(colors37))
# Example of RGB 
# rgb(red=1, green=0.05, blue=0.02, alpha=.2)
# Examples colors by name
# head(colors())

# Plot using a color space by name
plot(
  # Values on the x-axis
  x = 2:10,
  # Values on the y-axis
  y = 9:1,
  # Size of the point
  pch = 19,
  # Shape of the point
  cex = 2,
  # Color by name
  col = "dark red",
  # Axis labels
  xlab = "",
  ylab = "",
  axes = FALSE,
  # Limits for x and y
  xlim = c(2, 10.05),
  ylim = c(0, 11)
)

# Try running these loops again but change the cex value to get different shapes
for (i in 2:9) {
  # Plot using the RGB color space (arguments are values between [0,1]) 
  points(x = i:10, y = 10:i, pch = 19, cex = 2, col = rgb(runif(1), runif(1), runif(1)))
  # Plot using a vector of hexadecimal values
  points(x = 2:(11 - i), y = (10 - i):1, pch = 19, cex = 2, col = sample(colors37, 1))
}

# Draw a box
box()

In this section, we explore different ways to define and use colors in your plots, including by name, RGB values, and hexadecimal values.

Histograms

Histograms are a fundamental tool for visualizing the distribution (spread) of data around central values. To plot a single variable we can use the hist(...), and we need to include the specific vector that contains the data, for instance, hist(attitude[, 1L]).

### Plotting a simple histogram ####
hist(attitude[, 1L])

However, typically we are interested on visualize a whole set of variables on a data frame. Hence, I believe it is more useful a snipped of code that can plot a group of variables in grid (a group of plots). The subsequent R code sets up a 3x3 plotting layout using the par() function, allowing for a 3x3 grid of plots. Then the following code then creates histograms for each variable in the attitude dataframe and arranges them within the previously defined layout.

Here’s a step-by-step breakdown:

  • par(mfrow = c(3, 3)): This sets up a plotting layout with 3 rows and 3 columns, creating a 3x3 grid for plotting.

  • var.names <- colnames(attitude): Retrieves the column names (variable names) of the attitude dataframe.

  • invisible(lapply(seq_along(var.names), function(x) {...}): Iterates over each variable in attitude using lapply() and generates histograms.

  • hist(...): Generates a histogram for each variable. The hist() function takes parameters such as the data to be plotted (attitude[, x] for each variable), the main title (var.names[x], which is the variable name), and the color of the bars (col = sample(colors37, 1)).

  • The invisible() function is used to suppress the output of lapply() which would otherwise display the individual histograms.

### Using lapply and histograms ###

# Set up a 3 x 3 layout

par(mfrow = c(3, 3))

# render the histograms 
var.names <- colnames(attitude)
invisible(lapply(seq_along(var.names), function(x) {
  hist(
    attitude[, x],
    main = var.names[x],
    col = sample(colors37, 1),
    xlab = ""
  )
}))

### Box-Plots:

Box plots, also known as box-and-whisker plots, are a valuable tool for visualizing the distribution and spread of data. They provide a concise summary of a dataset’s central tendency, variability, and potential outliers. Unlike some other types of plots, box plots focus on displaying the overall distribution of data rather than showing individual data points.

They are particularly useful because they show the following key statistics:

  • The median (the middle value of the dataset).
  • The interquartile range (the range between the first quartile or Q1 and the third quartile or Q3), which contains the central 50% of the data.
  • The minimum and maximum values within a defined range.
  • Outliers, represented as individual points outside the “whiskers” of the plot.

Similar to the previous chunk, I am using the lapply combined with the function of box-plots (boxplot(...)) to efficiently display a plot for each variable in the attitude dataset.

  par(mfrow = c(3, 3))
  var.names <- colnames(attitude)
  invisible(lapply(seq_along(var.names), function(x) {
  boxplot(attitude[, x],
  main = var.names[x],
  xlab = "")
  }))

Scatter Plots:

Scatter plots are a basic form of data visualization that help us see the relationship between two continuous variables. They differ from other plots because they allow us to examine how two variables interact, specifically whether there is a linear or nonlinear relationship between them.

The importance of scatter plots lies in their ability to reveal patterns, trends, clusters and outliers in the data. By arranging data points as individual points on a two-dimensional plane, we can visually identify relationships, associations, or the lack thereof. Scatter plots are particularly useful for the following reasons.

  • Finding linear and non-linear relationships: Scatter plots help us to find out if two variables are linearly positively or negatively related or the relationship is not linear.

  • Identifying outliers: Outliers, data points that deviate significantly from the overall pattern, are easily detected in scatter plots and can be identified

  • Cluster analysis: Clusters of data points can indicate distinct subpopulations or clusters in the data.

  • Visualizing multivariate data: Scatter matrices like the one created in your code snippet allow us to visualize relationships between multiple variables simultaneously, which is important in exploratory data analysis

Using Scatter Plots in R:

pairs(attitude, main = "attitude data", panel = panel.smooth): This line generates a scatter plot matrix (a grid of scatter plots) for all the variables in the “attitude” dataset. The panel.smooth argument adds smoothed regression lines to each scatter plot to help visualize trends.

  # Plot variables: Useful to detect linear or non-linear patterns.
  pairs(attitude, main = "attitude data", panel = panel.smooth)

plot(attitude$rating, attitude$complaints): This line creates a single scatter plot between the “rating” and “complaints” variables, providing a detailed view of the relationship between these two specific variables.

  # Single plot
  plot(attitude$rating, attitude$complaints)

abline(lm(rating ~ complaints, data = attitude), col = 'red'): Here, a regression line is added to the scatter plot created in the previous step. This line represents the best-fit linear relationship between “rating” and “complaints” using a linear regression model. The line is colored red for visibility.

  plot(attitude$rating, attitude$complaints)
  # Draw a regression line
  abline(lm(rating ~ complaints, data = attitude), col = 'red')

Multiple Regression model

Imagine you have a yummy meal, and you want to know what makes it taste so good. Is it the color, the smell, or maybe the shape? Multiple regression helps us figure out which of these things, or variables, are most important in making the meal delicious. It’s like sniffing out the best part of a treat recipe! So the idea of regression is that we have a series of variables that affect or predict the behavior of another outcome variable. These explanatory variables are called determinants of the dependent variable precisely for their power to predict outcome.

  • Univariate models have only one determinant, but they are mostly unused. It is difficult to expect that one thing has only one predictor.
  # Single Regression
m0 <- lm(rating ~ advance, data = attitude)
summary(m0)
## 
## Call:
## lm(formula = rating ~ advance, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.7465  -4.8749   0.5975   7.4232  18.1526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.7558     9.7428   5.825 2.93e-06 ***
## advance       0.1835     0.2209   0.831    0.413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.24 on 28 degrees of freedom
## Multiple R-squared:  0.02405,    Adjusted R-squared:  -0.0108 
## F-statistic:  0.69 on 1 and 28 DF,  p-value: 0.4132
  • Multiple Regression model has two or more explanatory variables and it is the most frequent use model.
m1 <- lm(rating ~ ., data = attitude)
summary(m1)
## 
## Call:
## lm(formula = rating ~ ., data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9418  -4.3555   0.3158   5.5425  11.5990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.78708   11.58926   0.931 0.361634    
## complaints   0.61319    0.16098   3.809 0.000903 ***
## privileges  -0.07305    0.13572  -0.538 0.595594    
## learning     0.32033    0.16852   1.901 0.069925 .  
## raises       0.08173    0.22148   0.369 0.715480    
## critical     0.03838    0.14700   0.261 0.796334    
## advance     -0.21706    0.17821  -1.218 0.235577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.6628 
## F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05

Section 2: An introduction to Network Analysis using Igraph

Install packages

Before you start this section I recommend that you install the following packages:

  • igraph: A package for network analysis and visualization (most important).
  • tnet: A package for analyzing weighted, two-mode, and longitudinal networks.
  • data.table: A package for data manipulation and analysis.
  • qgraph: A package for creating and analyzing graphical models (e.g., network models) that we use only to improve the visualization of networks.
  • knitr: A package for dynamic report generation in R.

# packages
pks <- c('knitr', 'igraph', 'tnet', 'data.table', 'qgraph')
 
#Load and install packages
to.install <- pks[!unlist(lapply(pks, require, character.only = T ))]
if(length(to.install)!=0){install.packages(to.install, dependencies = T)}

Generate Graphs

To create networks, we have the option of utilizing both R base functions and functions within the igraph package.

Using Adjanceny Matrices

One approach involves generating a network using an adjacency matrix, where the rows and columns of the matrix correspond to vertices, and the values in the matrix indicate connections between these vertices.

#### Graph from a Matrix ####
n <- 5
g <- matrix(0, ncol = n, nrow = n)
val <- round(runif(sum(upper.tri(g)), min = 0, max = 1))
g[upper.tri(g)] <- val
g <- t(g) + g

# Create a graph object
g1 <- graph_from_adjacency_matrix(g, mode = "undirected")
as_bipartite()
## igraph layout specification, see ?layout_:
## layout_as_bipartite(<graph>, input = "C:/Users/mglez/Documents/PHD/Semester 16/15092023_network_blog/igraph_tu_mgs_v01.Rmd", 
##  igraph layout specification, see ?layout_:
##     encoding = "UTF-8")
?as_bipartite
## starting httpd help server ... done

Using Edge lists

  • The second most common way to generate a network is from an edge list or a set of pairs that define the connection between two vertices.
#### Graph from Edgelist ####
g1 <- graph_from_edgelist(t(combn(1:n,2)))

Using formulas

  • The third way to create a graph is by using specific formulas with the graph_from_literal function. This function enables us to create networks based on formulaic representations. Essentially, we specify the desired network structure using a compact formula notation within this function.

This is the general notation of the function:

  • -+: Represents a directed edge between two vertices. For example, A -+ B indicates a directed edge from vertex A to vertex B.

  • --: Represents an undirected edge between two vertices. For example, A -- B indicates an undirected edge between vertex A and vertex B.

  • ++: Represents a directed edge with an arrowhead at both ends, implying a bidirectional connection between two vertices. For example, A ++ B signifies bidirectional directed edges between vertex A and vertex B.

  • :: Represents a grouping of vertices. For example, A:B signifies that vertices A and B are in the same group or cluster within the network.

These notations allow you to define various types of connections and structures within your network using concise and expressive formulas.

#### Graph from Formula ####

#Undirected
par(mfrow = c(1, 4))
g1 <- graph_from_literal( A-B-C )

#Directed
g2 <- graph_from_literal( A -+ B ++ C )

#Undirected grouping
g3 <- graph_from_literal( A-B:C )

#Directed grouping
g4 <- graph_from_literal( A-+B:C )

#Plot the graphs
invisible(lapply(list(g1, g2, g3, g4), plot, vertex.size = 25, edge.arrow.size = .5))

Using igraph-functions

In the igraph package, there are several functions are available to generate networks using various algorithms and models. Here are some of the commonly used functions for network generation:

Erdős-Rényi Model (erdos.renyi.game): This function generates random networks following the Erdős-Rényi model. In this model, you specify the number of vertices (n) and the probability (p) of forming an edge between any pair of vertices. The default type is set to gnp for the probability model. This model is useful for creating networks where edges exist between pairs of vertices independently with a fixed probability.

Watts-Strogatz Model (watts.strogatz.game): The Watts-Strogatz model creates small-world networks with a combination of regularity and randomness. By default, it starts with a regular lattice where each vertex is connected to its (nei) nearest neighbors in a ring. Then, edges are rewired with probability p to introduce randomness. You can specify the number of vertices (n), the dimension of the lattice (dim), the number of neighbors (nei), and the rewiring probability (p) as parameters. This model helps generate networks that exhibit small-world properties.

Barabási-Albert Model (barabasi.game): This function generates networks using the Barabási-Albert model. By default, it creates an undirected network with n vertices and attaches each new vertex to m existing vertices with preferential attachment. The default is set to m = 1, meaning each new vertex connects to a single existing vertex. This model results in scale-free networks with a few highly connected nodes, which is a common property in many real-world networks.

Forest Fire Model (forest.fire.game): The forest fire model simulates the growth of networks using the forest fire algorithm. By default, it creates a network with n vertices. The m parameter specifies the number of edges added from the new vertex to the existing graph in each step. The p parameter controls the probability of spreading the fire to existing vertices. This model is suitable for generating networks with a specified number of vertices and a desired average degree while considering network growth dynamics.

# Set the number of vertices for all networks
n <- 25

# Generate networks using different models
g1 <- erdos.renyi.game(n, p = 0.2)
g2 <- watts.strogatz.game(n, dim = 1, size = 4, p = 0.2)
g3 <- barabasi.game(n, m = 1)
g4 <- forest.fire.game(n, fw.prob = 0.2, bw.factor = 2)

This is the full list of functions available in igraph to generate networks:

games <- grep("^.*game", ls("package:igraph"), value = TRUE)[-1]
games
##  [1] "aging.barabasi.game"         "aging.prefatt.game"         
##  [3] "asymmetric.preference.game"  "ba.game"                    
##  [5] "barabasi.game"               "bipartite.random.game"      
##  [7] "callaway.traits.game"        "cited.type.game"            
##  [9] "citing.cited.type.game"      "degree.sequence.game"       
## [11] "erdos.renyi.game"            "establishment.game"         
## [13] "forest.fire.game"            "grg.game"                   
## [15] "growing.random.game"         "hrg.game"                   
## [17] "interconnected.islands.game" "k.regular.game"             
## [19] "lastcit.game"                "preference.game"            
## [21] "random.graph.game"           "sbm.game"                   
## [23] "static.fitness.game"         "static.power.law.game"      
## [25] "watts.strogatz.game"

Visualizing Networks with igraph

To create insightful network visualizations using the igraph package in R, we’ll begin with plotting a simple network. Later on, we’ll explore customization options and demonstrate how to visualize multiple networks side by side.

Plotting a Simple Network

In this initial plot, we have our network displayed.

plot(g1)

However, to improve the visualization we can further customize the attributes of the plot function in the following way:

  • vertex.size: This attribute allows you to adjust the size of the nodes (vertices) in your network.
  • edge.arrow.size: If your network contains directed edges, you can modify the arrow size using this attribute.
  • vertex.color: Sets the color of nodes (here, light blue).
  • edge.color: Defines the color of edges (here, gray).
  • vertex.label: Removes node labels for a cleaner visualization.
  • layout: Specifies the layout algorithm; we used the Fruchterman-Reingold layout here.

Customizing network attributes:

# Visualizing multiple networks in a grid
par(mfrow = c(2, 2))
networks <- list(g1, g2, g3, g4)
invisible(lapply(networks, plot, 
                vertex.size = 25, 
                edge.arrow.size = 0.5,
                vertex.color = "lightblue", 
                edge.color = "gray",
                vertex.label = NA,
                layout = layout.fruchterman.reingold))

Layouts of igraph

The igraph has different algorithms called layouts, which help us visualize and highlight network patterns, degree distributions, and the spatial arrangement of vertices within a network.

# Generate a graph
n <- 15
g1 <- barabasi.game(n, directed = F)

# Explore the complete list of layouts.
# layouts <- grep("^layout_", ls("package:igraph"), value = TRUE)[-1]
# layouts <- layouts[!grepl("bipartite|sugiyama", layouts)]
# dput(layouts)

layouts <- c("layout_as_star", "layout_as_tree", "layout_components", "layout_in_circle", 
"layout_nicely", "layout_on_grid", "layout_on_sphere", "layout_randomly", 
"layout_with_dh", "layout_with_drl", "layout_with_fr", "layout_with_gem", 
"layout_with_graphopt", "layout_with_kk", "layout_with_lgl", 
"layout_with_mds")

par(mfrow = c(2, 2))
invisible(lapply(layouts, function(x){plot(
g1,
vertex.size = 30,
layout = eval(get(x)),
xlab = x
) }))

Setting shape of vertices

Vertex shapes in a network graph represent the graphical symbols used to depict individual vertices or nodes. They are a visual attribute that allows you to distinguish between nodes based on specific characteristics or groupings. Vertex shapes are useful in network visualization as they help convey additional information beyond just the connections between nodes.

To illustrate the usefulness of vertex shapes, we are creating an example where we visualize different vertex attributes. In this particular case, we’re generating a random variable called Age from random draws of a normal distribution, with a mean of 30 and a standard deviation of 5, and dividing the vertices into three distinct groups based on quantiles of this variable. Each group will be assigned a different vertex shape, making it clear which nodes belong to which category. This approach enhances the interpretability of the network by allowing you to visually identify nodes with similar attributes or characteristics.

## All shape forms
shapes <- c(
  'circle',
  'square',
  'csquare',
  'rectangle',
  'crectangle',
  'vrectangle',
  'pie',
  'raster',
  'sphere'
)
V(g1)$Age <- rnorm(n, 30, 5)
q <- quantile(V(g1)$Age, c(0, .33, .66, 1))
ind <- cut(V(g1)$Age, q, include.lowest = T, labels = F)
shape <- ifelse(ind==1, 'csquare', ifelse(ind==2, 'circle', 'sphere'))

plot(
  g1,
  vertex.size = 15,
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely
)

Setting colours of vertices and adding legends to plots

Generate a vector of attributes by sampling with replacement from a set Gender = {female, male}, and set a new attribute to the graph called gender. Also, add legends using the legend function and pass the desired arguments.

V(g1)$gender <- sample(c("male", "female"), n, replace = T)

plot(
  g1,
  vertex.size = 15,
  vertex.color = ifelse(V(g1)$gender == "male", "light blue", "pink"),
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely
)

legend(
  # Position
  x = -1.5,
  y = -1.1,
  # Legends
  c("male", "female"),
  # Mark type (circle)
  pch = 21,
  col = 1,
  pt.bg = c("light blue", "pink"),
  pt.cex = 2,
  cex = .8,
  bty = "n",
  ncol = 1
)

Setting colours to groups of vertices

We can emphasize groups of vertices in the graph. Find the vertex with the highest degree centrality and mark the adjacent vertices in a group.

class <- adjacent_vertices(g1, which.max(degree(g1)), mode = c("all"))
plot(
  g1,
  vertex.size = 15,
  vertex.color = ifelse(V(g1)$gender == "male", "light blue", "pink"),
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely,
  mark.groups = class,
  mark.col = rainbow(length(class))
)

Set colours and thickness to edges

g1 <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E', 'F', 'G')
), directed = F)

# and plot it:
cl <- cliques(g1, min = 3, max = 3)
c.ed <- lapply(cl, function(x)
  E(g1, path = c(x, x[1])))
plot(g1,
  layout = layout.star,
  edge.width = edge.betweenness(g1),
  edge.color = ifelse(E(g1) %in% unlist(c.ed), "red", "gray"))

Subgraphs

Find the vertices adjacent to the vertex A and then plot a subgraph of the neighborhood.

g1 <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E', 'F', 'G')
), directed = F)

# Find the vertices adjacent to A
v <- 'A'
neig.a <- adjacent_vertices(g1, v, mode = c("all"))

# Subgraph the neighborhood of A
g2 <- induced.subgraph(g1, c(v, names(neig.a[[1]])))

# Plot the Graphs
par(mfrow = c(1, 2))
plot(g1)
plot(g2)

Network Statistics

Local Clustering

Local clustering of a vertex $v$ is the ratio of the number of 3-cliques, or triangles, that fall in to $v$ and the number of connected triplets from which two edges are incident to $v$. For instance, for vertex $A$, the number of triangles, is defined as the cardinality of the set of vertices $\Delta_{A}={(A,C,B), (A,C,D), (A,D,E), (A,E,D)}$. Similarly the number of connected triplets is the set $T={(A,C,B), (A,C,D), (A,D,E), (A,E,D), (C,A,E), (D,A,B) }$. The local clustering coefficient $C_{A}=\frac{|\Delta_{A}|}{|T|}=\frac{2}{3}$. The local clustering is not defined for topologies, such as stars, trees, lattices.

 g <-
    graph.data.frame(data.frame(
    from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
    to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
    ), directed = F)

    #Graph
    plot(g, layout = layout.star)

    #Transitivity of A
    transitivity(g, v = "A", "local")
## [1] 0.6666667
    transitivity(graph.lattice(5), "local")
## [1] NaN   0   0   0 NaN
    transitivity(graph.star(5, mode = "undirected"), "local")
## [1]   0 NaN NaN NaN NaN
    transitivity(graph.tree(5, mode = "undirected"), "local")
## [1]   0   0 NaN NaN NaN

Degree and Strength

Add two new edges to the graph, from to , and from to , and plot the graph. Compare the measurements of and for the vertex $A$, notice that the vertices adjacent to are ${B,C,D,E)}$ but both measurements have a value of $7$. Simplify the graph, remove multiples edges and loops, and compute the degree centrality. Finally, use the count\_multiple function to compute weights for each edge in the graph and calculate strength centrality one more time.

    g <- graph.data.frame(data.frame(
      from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
      to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
      ), directed = F)

    # Add edge
    g <- add_edges(g, c('A', 'B', 'A', 'A'))

    # Plot the graph
    plot(g)

    # Degree vs Strength
    degree(g, V(g)$name == 'A', mode = 'all')
## A 
## 7
    strength(g, V(g)$name == 'A', mode = 'all')
## A 
## 7
    # Simplify Degree
    degree(simplify(g, remove.multiple = T, remove.loops = T),
    V(g)$name == 'A',
    mode = 'all')
## A 
## 4
    # Strength with Weights
    E(g)$weight <- count_multiple(g)
    g <- simplify(g)
    strength(g, V(g)$name == 'A', mode = 'all')
## A 
## 7

Eigenvector Centrality

The intuition of is to capture the importance of the neighborhood of each vertex. Vertices who are connected to adjacent vertices with higher degree centrality will perform better in this measurement. The interest lies in finding a vector that represents a ranking of relative importance for each vertex. This is similar to finding a solution for the eigenvalue problem.

\[Aw = \lambda w\]

Suppose that we can assign equal weights $w_1$, to each vertex, and then perform $Aw_1=w_2$, similar to computing a weighted degree. Then use $w_2$ to perform $n$ iterations till difference between $Aw_{n} - \lambda w_{n+1}$ gets closer to zero. Run a algorithm of eigenvector centrality and compare the results with the function.

# graph: 
    g <- graph.data.frame(data.frame(
      from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
      to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
      ), directed = F)

# Eigen vector centrality algorithm    
eigenvector.centrality = function(g, t=7) {
  A = get.adjacency(g)
  n <- nrow(A)
  #Degree
  # w <- max(A%*%rep(1, n))
  w <- n
  #Create a vector of weights
  x1 <- rep(1/w, n)
  #Create a vector of zeros (for the initial interation)
  x0 <- rep(0, n)
  #Presicion of the computation
  pre <- 1/10^t
  #Index of interaction
  iter <- 0
  while ( sum(abs(x0 - x1)) > pre) {
    #Store the current weight for comparison in the next interaction
    x0 <- x1
    #Compute a weighted degree
    x1 <- as.vector(A %*% x1)
    #Get the biggest weight
    w <- x1[which.max(abs(x1))]
    #Get a new vector of weights
    x1 <- x1 / w
    #Save the interations
    iter <- iter + 1
  }
  return(list(vector = x1, value = w, iter = iter))
}

#Compute eigenvector centrality scores
eigen_centrality(g)$vector
##        A        B        C        D        E 
## 1.000000 0.809017 0.809017 0.809017 0.809017
eigenvector.centrality(g, 7)$vector
## [1] 1.000000 0.809017 0.809017 0.809017 0.809017

Weighted Measurements of centrality

The package has implementations of weighted versions of , and . Lets generate a graph of $n$ vertices and, compare the difference between the weighted and unweighted centrality measurements.

  n <- 20
    # Generate an undirected weighted graph
    w <- matrix(0L, nrow = n, ncol = n)
    
    # Squewed distributon
    val <- rnbinom(sum(upper.tri(w)), prob=1/5, size = 1)
    w[upper.tri(w)] <- val
    w <- w + t(w)
    g_w <- as.tnet(w, type = 'weighted one-mode tnet')
    g_w.1 <- graph_from_adjacency_matrix(w)
    E(g_w.1)$weight <- count.multiple(g_w.1)

    # Generate an undirected unweighted graph
    uw <- matrix(0L, nrow = n, ncol = n)
    uw[w > 0] <- 1
    g_uw <- graph_from_adjacency_matrix(uw, mode = 'undirected')
    # Strength
    st <- strength(g_w.1)
    d <- degree(g_w.1)
    dw <- degree_w(g_w)
      
    # Betweeness
    bu <- betweenness(g_uw)
    bw <- betweenness_w(g_w)
    
    # Closeness
    cu <- closeness(g_uw)
    cw <- closeness_w(g_w)
    out <-
    data.frame(
    vertex = 1:n,
    degree = d,
    w.degree = dw[,3],
    strength = st,
    betweenness = bu,
    w.betweenness = bw[, 2],
    closeness = cu,
    w.closeness = cw[, 3]
    )
    
   kable(out, format = "markdown")
vertex degree w.degree strength betweenness w.betweenness closeness w.closeness
1 136 68 1360 2.104459 7 0.0400000 0.0027701
2 206 103 3390 2.850419 29 0.0400000 0.0033045
3 142 71 1310 3.630675 6 0.0434783 0.0024859
4 132 66 1664 1.697183 7 0.0384615 0.0029117
5 144 72 1240 2.774312 9 0.0434783 0.0027380
6 84 42 496 2.129892 0 0.0384615 0.0022731
7 104 52 476 4.233103 1 0.0454545 0.0020333
8 148 74 1496 1.338850 13 0.0370370 0.0027492
9 130 65 762 2.880625 1 0.0434783 0.0023273
10 180 90 3248 3.568456 18 0.0434783 0.0031294
11 194 97 2246 3.685820 21 0.0454545 0.0032645
12 116 58 1212 2.174228 3 0.0384615 0.0022673
13 86 43 590 1.078691 0 0.0370370 0.0024054
14 144 72 988 4.050985 8 0.0434783 0.0028601
15 88 44 404 1.909510 0 0.0384615 0.0020143
16 126 63 822 4.247017 1 0.0454545 0.0025134
17 162 81 1314 1.615571 18 0.0384615 0.0033866
18 178 89 2278 2.254387 16 0.0400000 0.0032046
19 104 52 644 1.613647 5 0.0384615 0.0025963
20 196 98 2500 4.162168 32 0.0454545 0.0032223

Structural Holes

Structural Holes are separations between groups observed from discontinuities in network structure. The absence of structural holes signals saturation in the capacity of individuals(vertices) to create novel connections outside their group. Saturation occurs when individuals reach a limit on the number of connections they can create and maintain. When individual’s resources are concentrated in a single group structural holes are absent or scare. The scarcity of holes represent constraints to collaborate outside a single research team that leads to redundancy of information and inability to capitalize novel ideas from different research teams.

g <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'E', 'F', 'G', 'D', 'G', 'G', 'G', 'G', 'G')
), directed = F)


plot(g)

A <- as.matrix(get.adjacency(g))
kable(A, format = "markdown")
  A B C D E F G
A 0 1 0 0 1 1 1
B 1 0 0 1 0 0 1
C 0 0 0 0 0 0 1
D 0 1 0 0 0 0 1
E 1 0 0 0 0 0 1
F 1 0 0 0 0 0 1
G 1 1 1 1 1 1 0

To calculate the constraints to bridge structural holes, the first step is to calculate, $i$, individual proportion of resources allocated to $j$ connections.

\[p_{ij} = z_{ij} / z_{iq}\]
# degree for undirected (Sum of resources spend in each connection)
d <- (A * upper.tri(A)) %*% matrix(1, nrow = nrow(A), ncol = 1)

# Matrix of degree
D <- matrix(rep(d, ncol(A)), nrow = nrow(A), ncol = ncol(A))

#  Matrix of time and enery invested on others z_iq = d - z_ij
z_iq <- (D * upper.tri(D)) - (A * upper.tri(A))

# Matrix of proportion of i's time an energy allocated to j's connections.
P <- (A * upper.tri(A))/z_iq

Redundancy of Centrality in Complete graphs

There are some cases in which the measurements of centrality will not provide relevant information. For instance, if the structure of an undirected network approaches a complete graph, each pair of different vertices is connected by a unique edge $\forall {i \neq j}:E(v_i,v_j)=1$, then the centrality measures will not yield relevant information. Take into consideration the following example, where I generate a fully connected graph:

## Creating Adjacency Matrix
A <- matrix(rep(1,25), ncol = 5, nrow = 5) 
diag(A) <- 0
G <- graph_from_adjacency_matrix(A, mode = "undirected")

## Creating the 
cols <- data.frame(
    degree(G),
    closeness(G),
    constraint(G),
    transitivity(G, type = 'local'),
    eigen_centrality(G)$vector,
    betweenness(G)
  )


colnames(cols) <- gsub("\\.G\\..*", "", colnames(cols))

kable(cols,  format = "markdown")
degree closeness constraint transitivity eigen_centrality betweenness
4 0.25 0.765625 1 1 0
4 0.25 0.765625 1 1 0
4 0.25 0.765625 1 1 0
4 0.25 0.765625 1 1 0
4 0.25 0.765625 1 1 0

To see more clearly the issue of redundancy of network measurements, I have created this snipped of code with a simulation. The code calculates the different network statistics keeping constant the number of edges but increasing the number of connections until the network is fully connected. In a nutshell the snipped, calculates network statistics for networks with the same number of vertices but an increasing number of connections conn <- c(seq(from=points, to=triag.matrix, by=round(triag.matrix/points)), triag.matrix). Using a loop, we iterate this sequence sampling randomly the connections in the conn sequence as follows: sample(triag.matrix, conn[i]).

#### Compute the average for each network centrality ####
n <- 100
triag.matrix <- ((n^2)-n)/2
points <-100
conn <- c(seq(from=points, to=triag.matrix, by=round(triag.matrix/points)), triag.matrix)
i <- 20
out.list <- list()
for(i in seq_along(conn)){
g <- matrix(0, ncol = n, nrow = n)  
val <- sample(triag.matrix, conn[i])
g[upper.tri(g)][val]<- 1
g <- t(g)+g
g <- graph_from_adjacency_matrix(g, mode = 'undirected')
t <- transitivity(g, type = 'localundirected')
t[is.na(t)]<- 0
out.list[[i]] <-
  data.frame(
    degree = mean(degree(g)),
    closeness = mean(closeness(g)),
    betweennes = mean(betweenness(g)),
    transitivity = mean(t),
    eigen.cent = mean(eigen_centrality(g)$vector),
    struc.holes = mean(constraint(g))
  )
}

Now that we have calculated the network statistics the goal of this snippet is to plot the network statistics accordingly. Each plot shows how a specific network statistic changes as the number of connections in the network increases. The vertical axis represents the values of various network statistics.The horizontal axis represents the number of connections in the network as they increase gradually untill they reach the fully connected graph.

As it becomes clear, when the connectivity level of the graph increases, the network centrality measurements become more and more similar. The results of this simulation suggest that network centrality measurements become redundant as a graph approaches a fully connected network. This is because all nodes in a fully connected network have the same fundamental network structure.

out.data <- rbindlist(out.list)
rm(out.list)
par(mfrow = c(2, 3))

x <- 1:nrow(out.data)
for(j in 1:ncol(out.data)){
y <- out.data[[j]]
y[is.na(y)] <- 0
st <- sqrt(var(y))
plot (x, y, ylim=c(0,max(y)), main = colnames(out.data)[j] )
segments(x,y-st,x,y+st)
epsilon <- 0.02
segments(x-epsilon,y-st,x+epsilon,y-st)
segments(x-epsilon,y+st,x+epsilon,y+st)

}

Network analysis with Data

In this code snippet, you will learn how to perform a basic network analysis on a real-world dataset. We start by downloading a network dataset, specifically the “ca-netscience” dataset, from an online source. This dataset represents a co-authorship network in the field of network science.

The code proceeds to unzip and load the dataset into R. We then construct a graph from the dataset using the igraph package, representing the relationships between authors. And finally, we calculate various network metrics such as degree centrality, closeness centrality, betweenness centrality and other network statistics.

I encourage you to explore similar datasets on the Stanford Large Network Dataset Collection and the NetworkRepository, they have a wide range of sample empirical data for your analyses.

# Download network data
 # setwd('C:/r_tutorial') 
# Data from: https://arxiv.org/abs/physics/0605087    
 download.file('http://nrvis.com/download/data/ca/ca-netscience.zip',
                  destfile = 'ca-netscience.zip')
 unzip('ca-netscience.zip')
 g <- read.csv('ca-netscience.mtx', sep = " ", header = F, skip = 2)
 g <- graph_from_data_frame(g, directed = F)
 E(g)$weight <- count.multiple(g)

 # Perform a basic network analysis (degree, closeness, betweenness, 
 # transitivity, eigenvector.centrality)
 # Store the results in a data.frame for analysis, add a column of ids. 
 
 na <- data.frame(
    id = V(g)$name,
    closeness = closeness(g, mode = "all", normalized = F),
    degree = degree(g, mode = "all", normalized = F, loops = F),
    strength= strength(g),
    betweenness = betweenness(g, directed = F, normalized = F),
    struc_hole = constraint(g),
    transitivity = transitivity(g, "localundirected"),
    eigen_centrality= eigen_centrality(g, scale = F)$vector)
 
kable(na[1:10,],  format = "markdown")
  id closeness degree strength betweenness struc_hole transitivity eigen_centrality
2 2 0.0004627 2 2 0.000 0.8650000 1.0000000 0.0148383
3 3 0.0004627 2 2 0.000 0.8650000 1.0000000 0.0148383
4 4 0.0005643 34 34 10834.473 0.0955073 0.1336898 0.4142993
5 5 0.0006075 27 27 17858.003 0.1071098 0.1823362 0.3562072
16 16 0.0005211 21 21 1131.347 0.1690397 0.2761905 0.3464503
44 44 0.0005882 4 4 12460.579 0.2941086 0.5000000 0.0885522
113 113 0.0005033 15 15 4601.692 0.1901735 0.2571429 0.0222461
131 131 0.0004760 12 12 3238.339 0.2118647 0.3333333 0.0213000
250 250 0.0005089 6 6 13.200 0.3374061 0.8666667 0.1470503
259 259 0.0004735 3 3 0.000 0.4537654 1.0000000 0.0176052

Community Detection

Lastly, I would like to show you how to perform a basic community detection analysis. Community detection in network analysis is the process of identifying groups or communities of nodes within a network that are more densely connected to each other than to nodes outside their community. These communities represent subsets of nodes which may possess similar characteristics, functions, or roles within the network.

Here’s a brief explanation of some community detection methods available in igraph:

edge.betweenness.community: This method identifies communities based on edge betweenness centrality. It removes edges with the highest betweenness values iteratively, eventually breaking the network into communities.

fastgreedy.community: It uses a greedy optimization approach to find hierarchical communities by optimizing modularity, a measure of community structure quality.

infomap.community: This method employs the Infomap algorithm, which treats the network as a flow of information and detects communities by minimizing the expected description length of the information flow.

label.propagation.community: Nodes are assigned labels, and communities form based on the propagation of these labels through the network. It’s a simple and fast method.

leading.eigenvector.community: This approach uses spectral graph theory and the leading eigenvector of the network’s adjacency matrix to detect communities.

multilevel.community: It’s a multilevel algorithm that optimizes modularity by moving nodes between communities, iteratively improving the community structure.

spinglass.community: This method is based on spin glass models from statistical physics, which maximize a Hamiltonian function to find communities.

walktrap.community: It uses random walks to find communities by detecting nodes that are more likely to be visited together during random walks on the network.

For this example I will use the same data as the previos snipped.

comms <- c("edge.betweenness.community", "fastgreedy.community", "infomap.community",
"label.propagation.community", "leading.eigenvector.community",
"multilevel.community", "spinglass.community",
"walktrap.community")

V(g)$frame.color <- "white"
V(g)$label <- ""
E(g)$arrow.mode <- 0

plot.comm <- function(comm){
V(g)$color <- colors37[get(comm)(g)$membership]
l <- qgraph.layout.fruchtermanreingold(get.edgelist(g, names = F), vcount=vcount(g),
      area=8*(vcount(g)^2),repulse.rad=(vcount(g)^3.1))
plot(g,layout=l,vertex.size=5, main= comm)
}

invisible(lapply(comms, plot.comm))