TITLE: Analysing BibTeX files in R
DATE: 2019-09-12
AUTHOR: John L. Godlee
====================================================================


I have a master BibTeX file called lib.bib, which contains 
bibliographic information on every paper I've read, which pairs 
with a directory of those papers' .pdf files. I thought it would be 
fun to see if there were patterns in my reading which I could find 
by analysing lib.bib in R.

I have a bash script which extracts bibliographic information from 
each BibTeX entry and stores it as a text file:

    #!/bin/bash

    # Extract year of publication
    cat ~/google_drive/lib.bib | grep -E "year = [0-9]{4}" | grep 
-oE "[0-9]{4}" > years.txt

    # Extract all authors per paper, clean
    cat ~/google_drive/lib.bib | grep -E "author = {" | sed 's/.*= 
{\([^]]*\)},.*/\1/g' | sed 's/[^A-z \-]//g' | sed 's/\\//g' | sed 
's/ and /,/g' > authors.txt

    # Extract journal
    cat ~/google_drive/lib.bib | grep -E "journal = {|publisher = 
{|url = |institution = {|organization = {|school = {" | sed 
's/.*{\([^]]*\)}.*/\1/g' > journals.txt

    Rscript analysis.R

It makes three files, one containing the year of publication, one 
containing the authors for each publication, and one containing the 
publication name.

Extracting author names was the most difficult because names are 
not always formatted the same, especially those names which contain 
{van der} Putten for example, where the actual initial of the 
surname is not v but P in the example above. One interesting trick 
I found was using sed to extract text between the first occurrence 
of one character, and the last occurrence of another character, 
ignoring repeats of those characters. I used this to extract author 
names between { } despite some authors having {van der} in their 
surname:

    sed 's/.*= {\([^]]*\)},.*/\1/g'

Then the bash script calls an R script:

    # Packages
    library(dplyr)
    library(ggplot2)
    library(igraph)
    library(ggnetwork)

    # Load data
    years <- readLines("years.txt")
    journals <- readLines("journals.txt")
    authors <- readLines("authors.txt")

    # Clean
    authors_list <- strsplit(x = authors, split = ",")

    papers <- data.frame(years = as.numeric(years), journals)

    papers$authors <- authors_list

    papers$num_authors <- sapply(authors_list, length)

papers$authors actually contains a list where each row is a vector 
of author names for a paper

The first plot draws a correlation between year of publication and 
number of authors:

    # Plot correlation between year of publication and number of 
authors
    year_author_correl <- ggplot(papers, aes(x = years, y = 
num_authors)) +
      geom_point() +
      theme_classic() +
      labs(x = "Year", y = "authors (n)") +
      scale_y_continuous(trans = 'log', breaks = 
c(0,1,2,3,4,6,8,10,20,40,60,80,100,140,180))

  ![Plot of year of publication and number of 
authors](https://johngodlee.xyz/img_full/bibtex_analysis/year_author
_correl.png)

The next two plots are bar graphs of the frequency of the most 
common authors (first and co-authors) and the most common first 
authors:

    ## Get list of most common authors
    author_all <- unlist(papers$authors)

    ## Get top ten authors
    author_top_ten_df <- data.frame(sort(table(author_all), 
decreasing = TRUE)[1:10])
    names(author_top_ten_df) <- c("author", "freq")

    ## Plot
    author_top_ten <- ggplot(author_top_ten_df, aes(x = author, y = 
freq)) +
      geom_bar(stat = "identity", aes(fill = author), colour = 
"black") +
      theme_classic() +
      theme(legend.position = "none") +
      labs(x = "Author", y = "Frequency")

    ## Get top first authors
    author_common <- unlist(lapply(papers$authors, first))

    author_common_df <- data.frame(sort(table(author_common), 
decreasing = TRUE)[1:5])

    names(author_common_df) <- c("author", "freq")

    author_common_df_clean <- author_common_df %>%
      filter(freq > 1)

    ## Plot
    first_author_top <- ggplot(author_common_df_clean, aes(x = 
author, y = freq)) +
      geom_bar(stat = "identity", aes(fill = author), colour = 
"black") +
      theme_classic() +
      theme(legend.position = "none") +
      labs(x = "Author", y = "Frequency")

  ![Top ten authors in my 
collection](https://johngodlee.xyz/img_full/bibtex_analysis/author_t
op_ten.png)

  ![Top ten first authors in my 
collection](https://johngodlee.xyz/img_full/bibtex_analysis/first_au
thor_top.png)

The final plot is a network graph of shared authorship. This isn't 
perfect. What I would ideally like is to draw ellipses around 
groups of authors on the same paper, to see whether groups of 
authors tend to publish together multiple times, but I couldn't 
figure out how to do it with an igraph object:

    ## Create edge list
    authors_list_df <- list()

    for(i in 1:length(papers$authors)){
      authors_list_df[[i]] <- data.frame(author = 
papers$authors[[i]])
      authors_list_df[[i]]$paper_id <- rep(i, times = 
length(papers$authors[[i]]))
    }

    authors_df <- bind_rows(authors_list_df)

    authors_edge_df <- authors_df %>%
      inner_join(., authors_df, by = "paper_id") %>%
      filter(author.x != author.y) %>%
      count(author.x, author.y, paper_id)

    authors_vertex_meta <- authors_edge_df[,3]

    authors_edge <- authors_edge_df[,1:2] %>%
      graph_from_data_frame(., directed = FALSE)

    authors_edge_fort <- fortify(authors_edge)

    ## Plot
    author_network <- ggplot(authors_edge_fort) +
      geom_edges(aes(x = x, y = y, xend = xend, yend = yend), size 
= 0.5) +
      geom_point(aes(x = x, y = y), colour = "black", fill = 
"grey", shape = 21) +
      theme_void()

  ![Network of 
authorship](https://johngodlee.xyz/img_full/bibtex_analysis/author_n
etwork.png)