Dynamic network of collaboration in Machine Learning using R and Python.
Introduction
In this blog entry, I will use data from Web of Science to draw a
network of collaboration in the field of Machine Learning. I am going to
disentangle the core universities that have published the top highly
cited 2878
articles about Machine Learning in Web of Science. These
are the most important scientific contributions to the field downloaded
in October 2022.
I use data of Web of Science which is the most widely used database of research publications and citations. Most universities have a license to use this database for research purposes. My query is simple, I use the multidisciplinary Web of Science Core collection searching on the publication’s Tittle, Abstract or Keywords the word “Machine Learning”. Later I filter subsetting only the highly cited publications in the field.
Libraries
library(data.table)
library(ggplot2)
library(svglite)
## Warning: package 'svglite' was built under R version 4.2.1
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
Data
csvs <- dir(pattern = "savedrecs.*csv$")
csvs <- lapply(csvs, fread)
csvs <- rbindlist(csvs)
dim(csvs)
## [1] 2878 72
General Approach
I will use R
and Regex
(regular expressions) to clean the address
field of the publication to extract the affiliations of the authors.
Then I will use the d3graph
library of Python
to produce a dynamic
network of university collaboration.
Use Regex to clean the data
Now, that we have put together the files, it’s time to extract the
data from the Addresses
field. Looking closely at this column:
csvs$Addresses[1:3]
## [1] "[Muehlematter, Urs J.] Univ Zurich, Univ Hosp Zurich, Inst Diagnost & Intervent Radiol, Zurich, Switzerland; [Daniore, Paola; Vokinger, Kerstin N.] Univ Zurich, Inst Law, CH-8001 Zurich, Switzerland"
## [2] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China; [Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"
## [3] "[Raissi, Maziar; Karniadakis, George Em] Brown Univ, Div Appl Math, Providence, RI 02912 USA"
It is clear that this string has a pattern in which the authors are
surrounded by square brackets, for instance [Muehlematter, Urs J.]
,
and immediately after the record reports the university
Univ Zurich
. If the article is published by two or more different
universities the field will be separated by a ;
semicolon.
# First we aim for separating authors:
samp <- unlist(strsplit(csvs$Addresses[2], "; \\[" ))
samp
## [1] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China"
## [2] "Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"
# Then we aim to extract the universities
samp <- gsub(".*\\] ", "", samp)
samp
## [1] "Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China"
## [2] "Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"
# We have to clean everything after the comma
samp <- gsub(",.*", "", samp)
samp
## [1] "Hunan Univ" "Univ Elect Sci & Technol China"
Build the dataframe publication-university
Perfect, now we apply this to the whole dataset. We have to give the universities a unique identifier if they collaborate in the same publication.
ml_data <- list()
i <- 1L
for (i in 1L:nrow(csvs)) {
temp <- unlist(strsplit(csvs$Addresses[i], "; \\[" ))
temp <- gsub(".*\\] ", "", temp)
temp <- gsub(",.*", "", temp)
if(length(temp)>0){
ml_data[[i]] <- data.frame(id=i, univ=temp)
}
}
ml_data <- rbindlist(ml_data)
# We have 13074 in total
dim(ml_data)
## [1] 13074 2
# Subset only to the top 50 universities
ml_data <- ml_data[univ %in% names(sort(table(ml_data$univ), decreasing = T)[1:50]), ]
The edgelist
The edgelist is the key input that we need to plot the network. However,
we have to perform additional data manipulation before we have a list of
pairs of universities. The data so far contains pairs of
c(publication, university)
, however, what we need is a list
that contains pairs of universities when they work together in a project
c(university, university)
. The article by RPubs (2022), describes more
about this type of conversion from the theoretical angle. From the data
science perspective, I show here Gonzalez-Sauri (2022) several ways to
perform this transformation.
ml_data
## id univ
## 1: 2 Univ Elect Sci & Technol China
## 2: 4 MIT
## 3: 4 Northwestern Univ
## 4: 7 Chinese Acad Sci
## 5: 8 Univ Elect Sci & Technol China
## ---
## 2661: 2873 Tianjin Univ
## 2662: 2873 Natl Univ Singapore
## 2663: 2875 Wuhan Univ
## 2664: 2875 Univ Cambridge
## 2665: 2877 Univ Calif Berkeley
edge_lst <- merge(ml_data, ml_data, by = "id", allow.cartesian = TRUE)
edge_lst <- edge_lst[edge_lst$univ.x != edge_lst$univ.y, -1L]
dim(edge_lst)
## [1] 6594 2
I want to differentiate the strength of the link or edge, so, I will calculate the betweenness centrality at the level of the edge. I will append this to the dataset and then export it to a csv-file.
g1 <- igraph::graph_from_data_frame(edge_lst, directed = F)
edge_lst[, weight:= edge.betweenness(g1)]
setnames(edge_lst, c("source", "target", "weight"))
fwrite(edge_lst, "edge_lst.csv")
Top Universities in Machine Learning
Just for curiosity lets look at the top 20 universities working in the field of Machine Learning.
top_ml <- sort(table(ml_data$univ), decreasing = T)[1:20]
top_ml <- data.frame(top_ml)
colnames(top_ml) <- c("univ", "pubs")
p <- ggplot(top_ml, aes(x = univ, y = pubs, fill = pubs)) +
geom_bar(stat = "identity") + theme_minimal() + theme(axis.text.x = element_text(
angle = 45,
#vjust = 0.5,
hjust = 1
))
# save the picture
ggsave(file="top_ml_univ.svg", plot=p, width=16, height=10)
p
Python Dynamic Network
For the network, I will use the d3graph
library created by Taskesen
(2022).
import pandas as pd
from d3graph import d3graph, vec2adjmat
# Import data
df = pd.read_csv("./blog/edge_lst.csv")
# Show the input data
print(df)
# Create an adjaceny matrix
adjmat = vec2adjmat(source=df["source"].tolist(), target=df["target"].to_list())
# Initialize
d3 = d3graph()
# Build force-directed graph with default settings
d3.graph(adjmat)
# Show graph
d3.show()
Static Networks
The results are quite nice. First I would like to show the results filtering universities that have 20 edges (co-authored publications) or more.
Then we have the network when we filter only universities with more than 45 connections.
Dynamic Network
Finally, we have the main dynamic network that we can use to display several thresholds of network connections.
References
Gonzalez-Sauri. 2022. “What its the most efficient method to create an edgelist/adjacency matrix from two sets of IDs?” Stack Overflow. https://stackoverflow.com/questions/42764954/what-its-the-most-efficient-method-to-create-an-edgelist-adjacency-matrix-from-t.
RPubs. 2022. “RPubs - Bipartite/Two-Mode Networks in igraph.” https://rpubs.com/pjmurphy/317838.
Taskesen, Erdogan. 2022. “d3graph.” GitHub. https://github.com/erdogant/d3graph.