Dynamic network of collaboration in Machine Learning using R and Python.

Introduction

In this blog entry, I will use data from Web of Science to draw a network of collaboration in the field of Machine Learning. I am going to disentangle the core universities that have published the top highly cited 2878 articles about Machine Learning in Web of Science. These are the most important scientific contributions to the field downloaded in October 2022.

I use data of Web of Science which is the most widely used database of research publications and citations. Most universities have a license to use this database for research purposes. My query is simple, I use the multidisciplinary Web of Science Core collection searching on the publication’s Tittle, Abstract or Keywords the word “Machine Learning”. Later I filter subsetting only the highly cited publications in the field.

Libraries

library(data.table)
library(ggplot2)
library(svglite)

## Warning: package 'svglite' was built under R version 4.2.1

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

Data

csvs <- dir(pattern = "savedrecs.*csv$")
csvs <- lapply(csvs, fread)
csvs <- rbindlist(csvs)
dim(csvs)

## [1] 2878   72

General Approach

I will use R and Regex(regular expressions) to clean the address field of the publication to extract the affiliations of the authors. Then I will use the d3graph library of Python to produce a dynamic network of university collaboration.

Use Regex to clean the data

Now, that we have put together the files, it’s time to extract the data from the Addresses field. Looking closely at this column:

csvs$Addresses[1:3]

## [1] "[Muehlematter, Urs J.] Univ Zurich, Univ Hosp Zurich, Inst Diagnost & Intervent Radiol, Zurich, Switzerland; [Daniore, Paola; Vokinger, Kerstin N.] Univ Zurich, Inst Law, CH-8001 Zurich, Switzerland"                                      
## [2] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China; [Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"
## [3] "[Raissi, Maziar; Karniadakis, George Em] Brown Univ, Div Appl Math, Providence, RI 02912 USA"

It is clear that this string has a pattern in which the authors are surrounded by square brackets, for instance [Muehlematter, Urs J.], and immediately after the record reports the university Univ Zurich. If the article is published by two or more different universities the field will be separated by a ; semicolon.

# First we aim for separating authors:
samp <- unlist(strsplit(csvs$Addresses[2], "; \\[" ))
samp

## [1] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China"
## [2] "Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"

# Then we aim to extract the universities
samp <- gsub(".*\\] ", "",  samp)
samp

## [1] "Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China"                 
## [2] "Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"

# We have to clean everything after the comma
samp <- gsub(",.*", "",  samp)
samp

## [1] "Hunan Univ"                     "Univ Elect Sci & Technol China"

Build the dataframe publication-university

Perfect, now we apply this to the whole dataset. We have to give the universities a unique identifier if they collaborate in the same publication.

ml_data <- list()
i <- 1L
for (i in 1L:nrow(csvs)) {
  temp <- unlist(strsplit(csvs$Addresses[i], "; \\[" ))
  temp <- gsub(".*\\] ", "",  temp)
  temp <- gsub(",.*", "",  temp)
  if(length(temp)>0){
    ml_data[[i]] <- data.frame(id=i, univ=temp)
  }
  }

ml_data <- rbindlist(ml_data)

# We have 13074 in total
dim(ml_data) 

## [1] 13074     2

# Subset only to the top 50 universities
ml_data <- ml_data[univ %in% names(sort(table(ml_data$univ), decreasing = T)[1:50]),  ] 

The edgelist

The edgelist is the key input that we need to plot the network. However, we have to perform additional data manipulation before we have a list of pairs of universities. The data so far contains pairs of c(publication, university), however, what we need is a list that contains pairs of universities when they work together in a project c(university, university). The article by RPubs (2022), describes more about this type of conversion from the theoretical angle. From the data science perspective, I show here Gonzalez-Sauri (2022) several ways to perform this transformation.

ml_data

##         id                           univ
##    1:    2 Univ Elect Sci & Technol China
##    2:    4                            MIT
##    3:    4              Northwestern Univ
##    4:    7               Chinese Acad Sci
##    5:    8 Univ Elect Sci & Technol China
##   ---                                    
## 2661: 2873                   Tianjin Univ
## 2662: 2873            Natl Univ Singapore
## 2663: 2875                     Wuhan Univ
## 2664: 2875                 Univ Cambridge
## 2665: 2877            Univ Calif Berkeley

edge_lst <- merge(ml_data, ml_data, by = "id", allow.cartesian = TRUE)
edge_lst <- edge_lst[edge_lst$univ.x != edge_lst$univ.y, -1L]
dim(edge_lst)

## [1] 6594    2

I want to differentiate the strength of the link or edge, so, I will calculate the betweenness centrality at the level of the edge. I will append this to the dataset and then export it to a csv-file.

g1 <- igraph::graph_from_data_frame(edge_lst, directed = F)
edge_lst[, weight:= edge.betweenness(g1)] 
setnames(edge_lst, c("source", "target", "weight")) 
fwrite(edge_lst, "edge_lst.csv")

Top Universities in Machine Learning

Just for curiosity lets look at the top 20 universities working in the field of Machine Learning.

top_ml <- sort(table(ml_data$univ), decreasing = T)[1:20]
top_ml <- data.frame(top_ml)
colnames(top_ml) <- c("univ", "pubs")

p <- ggplot(top_ml, aes(x = univ, y = pubs, fill = pubs)) +
  geom_bar(stat = "identity") + theme_minimal() + theme(axis.text.x = element_text(
    angle = 45,
    #vjust = 0.5,
    hjust = 1
  ))

# save the picture
ggsave(file="top_ml_univ.svg", plot=p, width=16, height=10)

p

Python Dynamic Network

For the network, I will use the d3graph library created by Taskesen (2022).

import pandas as pd
from d3graph import d3graph, vec2adjmat

# Import data
df = pd.read_csv("./blog/edge_lst.csv")

# Show the input data
print(df)

# Create an adjaceny matrix
adjmat = vec2adjmat(source=df["source"].tolist(), target=df["target"].to_list())

# Initialize
d3 = d3graph()

# Build force-directed graph with default settings
d3.graph(adjmat)

# Show graph
d3.show()

Static Networks

The results are quite nice. First I would like to show the results filtering universities that have 20 edges (co-authored publications) or more.

Then we have the network when we filter only universities with more than 45 connections.

Dynamic Network

Finally, we have the main dynamic network that we can use to display several thresholds of network connections.

References

Gonzalez-Sauri. 2022. “What its the most efficient method to create an edgelist/adjacency matrix from two sets of IDs?” Stack Overflow. https://stackoverflow.com/questions/42764954/what-its-the-most-efficient-method-to-create-an-edgelist-adjacency-matrix-from-t.

RPubs. 2022. “RPubs - Bipartite/Two-Mode Networks in igraph.” https://rpubs.com/pjmurphy/317838.

Taskesen, Erdogan. 2022. “d3graph.” GitHub. https://github.com/erdogant/d3graph.

Table of Content: