TITLE: Analysing BibTeX files in R DATE: 2019-09-12 AUTHOR: John L. Godlee ==================================================================== I have a master BibTeX file called lib.bib, which contains bibliographic information on every paper I've read, which pairs with a directory of those papers' .pdf files. I thought it would be fun to see if there were patterns in my reading which I could find by analysing lib.bib in R. I have a bash script which extracts bibliographic information from each BibTeX entry and stores it as a text file: #!/bin/bash # Extract year of publication cat ~/google_drive/lib.bib | grep -E "year = [0-9]{4}" | grep -oE "[0-9]{4}" > years.txt # Extract all authors per paper, clean cat ~/google_drive/lib.bib | grep -E "author = {" | sed 's/.*= {\([^]]*\)},.*/\1/g' | sed 's/[^A-z \-]//g' | sed 's/\\//g' | sed 's/ and /,/g' > authors.txt # Extract journal cat ~/google_drive/lib.bib | grep -E "journal = {|publisher = {|url = |institution = {|organization = {|school = {" | sed 's/.*{\([^]]*\)}.*/\1/g' > journals.txt Rscript analysis.R It makes three files, one containing the year of publication, one containing the authors for each publication, and one containing the publication name. Extracting author names was the most difficult because names are not always formatted the same, especially those names which contain {van der} Putten for example, where the actual initial of the surname is not v but P in the example above. One interesting trick I found was using sed to extract text between the first occurrence of one character, and the last occurrence of another character, ignoring repeats of those characters. I used this to extract author names between { } despite some authors having {van der} in their surname: sed 's/.*= {\([^]]*\)},.*/\1/g' Then the bash script calls an R script: # Packages library(dplyr) library(ggplot2) library(igraph) library(ggnetwork) # Load data years <- readLines("years.txt") journals <- readLines("journals.txt") authors <- readLines("authors.txt") # Clean authors_list <- strsplit(x = authors, split = ",") papers <- data.frame(years = as.numeric(years), journals) papers$authors <- authors_list papers$num_authors <- sapply(authors_list, length) papers$authors actually contains a list where each row is a vector of author names for a paper The first plot draws a correlation between year of publication and number of authors: # Plot correlation between year of publication and number of authors year_author_correl <- ggplot(papers, aes(x = years, y = num_authors)) + geom_point() + theme_classic() + labs(x = "Year", y = "authors (n)") + scale_y_continuous(trans = 'log', breaks = c(0,1,2,3,4,6,8,10,20,40,60,80,100,140,180)) ![Plot of year of publication and number of authors](https://johngodlee.xyz/img_full/bibtex_analysis/year_author _correl.png) The next two plots are bar graphs of the frequency of the most common authors (first and co-authors) and the most common first authors: ## Get list of most common authors author_all <- unlist(papers$authors) ## Get top ten authors author_top_ten_df <- data.frame(sort(table(author_all), decreasing = TRUE)[1:10]) names(author_top_ten_df) <- c("author", "freq") ## Plot author_top_ten <- ggplot(author_top_ten_df, aes(x = author, y = freq)) + geom_bar(stat = "identity", aes(fill = author), colour = "black") + theme_classic() + theme(legend.position = "none") + labs(x = "Author", y = "Frequency") ## Get top first authors author_common <- unlist(lapply(papers$authors, first)) author_common_df <- data.frame(sort(table(author_common), decreasing = TRUE)[1:5]) names(author_common_df) <- c("author", "freq") author_common_df_clean <- author_common_df %>% filter(freq > 1) ## Plot first_author_top <- ggplot(author_common_df_clean, aes(x = author, y = freq)) + geom_bar(stat = "identity", aes(fill = author), colour = "black") + theme_classic() + theme(legend.position = "none") + labs(x = "Author", y = "Frequency") ![Top ten authors in my collection](https://johngodlee.xyz/img_full/bibtex_analysis/author_t op_ten.png) ![Top ten first authors in my collection](https://johngodlee.xyz/img_full/bibtex_analysis/first_au thor_top.png) The final plot is a network graph of shared authorship. This isn't perfect. What I would ideally like is to draw ellipses around groups of authors on the same paper, to see whether groups of authors tend to publish together multiple times, but I couldn't figure out how to do it with an igraph object: ## Create edge list authors_list_df <- list() for(i in 1:length(papers$authors)){ authors_list_df[[i]] <- data.frame(author = papers$authors[[i]]) authors_list_df[[i]]$paper_id <- rep(i, times = length(papers$authors[[i]])) } authors_df <- bind_rows(authors_list_df) authors_edge_df <- authors_df %>% inner_join(., authors_df, by = "paper_id") %>% filter(author.x != author.y) %>% count(author.x, author.y, paper_id) authors_vertex_meta <- authors_edge_df[,3] authors_edge <- authors_edge_df[,1:2] %>% graph_from_data_frame(., directed = FALSE) authors_edge_fort <- fortify(authors_edge) ## Plot author_network <- ggplot(authors_edge_fort) + geom_edges(aes(x = x, y = y, xend = xend, yend = yend), size = 0.5) + geom_point(aes(x = x, y = y), colour = "black", fill = "grey", shape = 21) + theme_void() ![Network of authorship](https://johngodlee.xyz/img_full/bibtex_analysis/author_n etwork.png)