(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ Examining linguistic shifts between preprints and publications ['David N. Nicholson', 'Department Of Systems Pharmacology', 'Translational Therapeutics', 'Perelman School Of Medicine University Of Pennsylvania', 'Philadelphia', 'Pennsylvania', 'United States Of America', 'Vincent Rubinetti', 'Center For Health Ai', 'University Of Colorado School Of Medicine'] Date: 2022-02 Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint–peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application ( https://greenelab.github.io/preprint-similarity-search/ ) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: Marvin Thielk receives a salary from Elsevier Inc. where he contributes NLP expertise to health content operations. Elsevier did not restrict the results or interpretations that could be published in this manuscript. Funding: This work was supported by grants from the Gordon Betty Moore Foundation (GBMF4552) and the National Institutes of Health’s National Human Genome Research Institute (NHGRI) under award R01 HG010067 to CSG and the National Institutes of Health’s NHGRI under award T32 HG00046 to DNN. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: An online version of this manuscript is available under a Creative Commons Attribution License at https://greenelab.github.io/annorxiver_manuscript/ . Source code for the research portions of this project is dual licensed under the BSD 3-Clause and Creative Commons Public Domain Dedication Licenses at https://github.com/greenelab/annorxiver . The preprint similarity search website can be found at https://greenelab.github.io/preprint-similarity-search/ , and code for the website is available under a BSD-2-Clause Plus Patent License at https://github.com/greenelab/preprint-similarity-search . All corresponding data for every figure in this manuscript is available at https://github.com/greenelab/annorxiver/blob/master/FIGURE_DATA_SOURCE.md . Full text access for the bioRxiv repository is available at https://www.biorxiv.org/tdm . Access to PubMed Central’s Open Access subset is available on NCBI’s FTP server at https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ . The New York Times Annotated Corpus (NYTAC) can be accessed from the Linguistic Data Consortium at https://catalog.ldc.upenn.edu/LDC2008T19 where there may be a $150 to $300 fee depending on membership status. To understand how preprints relate to the traditional publishing ecosystem, we examine the linguistic similarities and differences between preprints and peer-reviewed text and observe how linguistic features change during the peer review and publishing process. We hypothesize that preprints and biomedical text will appear to have similar characteristics, especially when controlling for the differential uptake of preprints across fields. Furthermore, we hypothesize that document embeddings [ 37 , 38 ] provide a versatile way to disentangle linguistic features along with serving as a suitable medium for improving preprint repository functionality. We test this hypothesis by producing a linguistic landscape of bioRxiv preprints, detecting preprints that change substantially during publication, and identifying journals that publish manuscripts that are linguistically similar to a target preprint. We encapsulate our findings through a web app that projects a user-selected preprint onto this landscape and suggests journals and articles that are linguistically similar. Our work reveals how linguistically similar and dissimilar preprints are to peer-reviewed text, quantifies linguistic changes that occur during the peer review process, and highlights the feasibility of document embeddings concerning preprint repository functionality and peer review’s effect on publication time. Textual analysis uses linguistic, statistical, and machine learning techniques to analyze and extract information from text [ 26 , 27 ]. For instance, scientists analyzed linguistic similarities and differences of biomedical corpora [ 28 – 30 ]. Scientists have provided the community with a number of tools that aide future text mining systems [ 31 – 33 ] as well as advice on how to train and test future text processing systems [ 34 – 36 ]. Here, we use textual analysis to examine the bioRxiv repository, placing a particular emphasis on understanding the extent to which full-text research can address hypotheses derived from the study of metadata alone. The rapid uptake of preprints in the life sciences also poses challenges. Preprint repositories receive a growing number of submissions [ 19 ]. Linking preprints with their published counterparts is vital to maintaining scholarly discourse consistency, but this task is challenging to perform manually [ 16 , 20 , 21 ]. Errors and omissions in linkage result in missing links and consequently erroneous metadata. Furthermore, repositories based on standard publishing tools are not designed to show how the textual content of preprints is altered due to the peer review process [ 19 ]. Certain scientists have expressed concern that competitors could scoop them by making results available before publication [ 19 , 22 ]. Preprint repositories by definition do not perform in-depth peer review, which can result in posted preprints containing inconsistent results or conclusions [ 17 , 20 , 23 , 24 ]; however, an analysis of preprints posted at the beginning of 2020 revealed that over 50% underwent minor changes in the abstract text as they were published, but over 70% did not change or only had simple rearrangements to panels and tables [ 25 ]. Despite a growing emphasis on using preprints to examine the publishing process within life sciences, how these findings relate to the text of all documents in bioRxiv has yet to be examined. Preprints are commonly hosted on online repositories, where users have open and easy access to these works. Notable repositories include arXiv [ 6 ], bioRxiv [ 7 ], and medRxiv [ 8 ]; however, there are over 60 different repositories available [ 9 ]. The burgeoning uptake of preprints in life sciences has been examined through research focused on metadata from the bioRxiv repository. For example, life science preprints are being posted at an increasing rate [ 10 ]. Furthermore, these preprints are being rapidly shared on social media, routinely downloaded, and cited [ 11 ]. Some preprint categories are shared on social media by both scientists and nonscientists [ 12 ]. About two-thirds to three-quarters of preprints are eventually published [ 13 , 14 ], and life science articles that have a corresponding preprint version are cited and discussed more often than articles without them [ 15 – 17 ]. Preprints take an average of 160 days to be published in the peer-reviewed literature [ 18 ], and those with multiple versions take longer to publish [ 18 ]. The dissemination of research findings is key to science. Initially, much of this communication happened orally [ 1 ]. During the 17th century, the predominant form of communication shifted to personal letters shared from one scientist to another [ 1 ]. Scientific journals didn’t become a predominant mode of communication until the 19th and 20th centuries when the first journal was created [ 1 – 3 ]. Although scientific journals became the primary method of communication, they added high maintenance costs and long publication times to scientific discourse [ 2 , 3 ]. Some scientists’ solutions to these issues have been to communicate through preprints, which are scholarly works that have yet to undergo peer review process [ 4 , 5 ]. Materials and methods Corpora examined Text analytics is generally comparative in nature, so we selected 3 relevant text corpora for analysis: the bioRxiv corpus, which is the target of the investigation; the PubMed Central Open Access (PMCOA) corpus, which represents the peer-reviewed biomedical literature; and the New York Times Annotated Corpus (NYTAC), which is used a representative of general English text. bioRxiv corpus bioRxiv [7] is a repository for life sciences preprints. We downloaded an XML snapshot of this repository on February 3, 2020, from bioRxiv’s Amazon S3 bucket [39]. This snapshot contained the full text and image content of 98,023 preprints. Preprints on bioRxiv are versioned, and in our snapshot, 26,905 out of 98,023 contained more than one version. When preprints had multiple versions, we used the latest one unless otherwise noted. Authors submitting preprints to bioRxiv can select one of 29 different categories and tag the type of article: a new result, confirmatory finding, or contradictory finding. A few preprints in this snapshot were later withdrawn from bioRxiv; when withdrawn, their content is replaced with the reason for withdrawal. We encountered a total of 72 withdrawn preprints within our snapshot. After removal, we were left with 97,951 preprints for our downstream analyses. PubMed Central Open Access corpus PubMed Central (PMC) is a digital archive for the United States National Institute of Health’s Library of Medicine (NIH/NLM) that contains full text biomedical and life science articles [40]. Paper availability within PMC is mainly dependent on the journal’s participation level [41]. Articles appear in PMC as either accepted author manuscripts (Green Open Access) or via open access publishing at the journal (Gold Open Access [42]). Individual journals have the option to fully participate in submitting articles to PMC, selectively participate sending only a few papers to PMC, only submit papers according to NIH’s public access policy [43], or not participate at all; however, individual articles published with the CC BY license may be incorporated. As of September 2019, PMC had 5,725,819 articles available [44]. Out of these 5 million articles, about 3 million were open access (PMCOA) and available for text processing systems [32,45]. PMC also contains a resource that holds author manuscripts that have already passed the peer review process [46]. Since these manuscripts have already been peer reviewed, we excluded them from our analysis as the scope of our work is focused on examining the beginning and end of a preprint’s life cycle. We downloaded a snapshot of the PMCOA corpus on January 31, 2020. This snapshot contained many types of articles: literature reviews, book reviews, editorials, case reports, research articles, and more. We used only research articles, which align with the intended role of bioRxiv, and we refer to these articles as the PMCOA corpus. The New York Times Annotated Corpus The NYTAC [47] is a collection of newspaper articles from the New York Times dating from January 1, 1987 to June 19, 2007. This collection contains over 1.8 million articles where 1.5 million of those articles have undergone manual entity tagging by library scientists [47]. We downloaded this collection on August 3, 2020, from the Linguistic Data Consortium (see Software and data availability section) and used the entire collection as a negative control for our corpora comparison analysis. Mapping bioRxiv preprints to their published counterparts We used CrossRef [48] to identify bioRxiv preprints linked to a corresponding published article. We accessed CrossRef on July 7, 2020, and successfully linked 23,271 preprints to their published counterparts. Out of those 23,271 preprint–published pairs, only 17,952 pairs had a published version present within the PMCOA corpus. For our analyses that involved published links, we only focused on this subset of preprints–published pairs. Comparing corpora We compared the bioRxiv, PMCOA, and NYTAC corpora to assess the similarities and differences between them. We used the NYTAC corpus as a negative control to assess the similarity between 2 life sciences repositories compared with nonlife sciences text. All corpora contain multiple words that do not have any meaning (conjunctions, prepositions, etc.) or occur with a high frequency. These words are termed stopwords and are often removed to improve text processing pipelines. Along with stopwords, all corpora contain both words and nonword entities (for instance, numbers or symbols like ±), which we refer to together as tokens to avoid confusion. We calculated the following characteristic metrics for each corpus: the number of documents, the number of sentences, the total number of tokens, the number of stopwords, the average length of a document, the average length of a sentence, the number of negations, the number of coordinating conjunctions, the number of pronouns, and the number of past tense verbs. SpaCy is a lightweight and easy-to-use python package designed to preprocess and filter text [49]. We used spaCy’s “en_core_web_sm” model [49] (version 2.2.3) to preprocess all corpora and filter out 326 stopwords using spaCy’s default settings. Following that cleaning process, we calculated the frequency of every token across all corpora. Because many tokens were unique to one set or the other and observed at low frequency, we focused on the union of the top 0.05% (approximately 100) most frequently occurring tokens within each corpus. We generated a contingency table for each token in this union and calculated the odds ratio along with the 95% confidence interval [50]. We measured corpora similarity by calculating the Kullback–Leibler (KL) divergence across all corpora along with token enrichment analysis. KL divergence is a metric that measures the extent to which 2 distributions differ from each other. A low value of KL divergence implicates that 2 distributions are similar and vice versa for high values. The optimal number of tokens used to calculate the KL divergence is unknown, so we calculated this metric using a range of the 100 most frequently occurring tokens between 2 corpora to the 5,000 most frequently occurring tokens. Constructing a document representation for life sciences text We sought to build a language model to quantify linguistic similarities of biomedical preprints and articles. Word2vec is a suite of neural networks designed to model linguistic features of tokens based on their appearance in the text. These models are trained to either predict a token based on its sentence context, called a continuous bag of words (CBOW) model, or predict the context based on a given token, called a skipgram model [37]. Through these prediction tasks, both networks learn latent linguistic features, which are helpful for downstream tasks, such as identifying similar tokens. We used gensim [51] (version 3.8.1) to train a CBOW [37] model over all the main text within each preprint in the bioRxiv corpus. Determining the best number of dimensions for token embeddings can be a nontrivial task; however, it has been shown that optimal performance is between 100 and 1,000 dimensions [52]. We chose to train the CBOW model using 300 hidden nodes, a batch size of 10,000 tokens, and for 20 epochs. We set a fixed random seed and used gensim’s default settings for all other hyperparameters. Once trained, every token present within the CBOW model is associated with a dense vector representing latent features captured by the network. We used these token vectors to generate a document representation for every article within the bioRxiv and PMCOA corpora. We used spaCy to lemmatize each token for each document and then took the average of every lemmatized token present within the CBOW model and the individual document [38]. Any token present within the document but not in the CBOW model is ignored during this calculation process. Visualizing and characterizing preprint representations We sought to visualize the landscape of preprints and determine the extent to which their representation as document vectors corresponded to author-supplied document labels. We used principal component analysis (PCA) [53] to project bioRxiv document vectors into a low-dimensional space. We trained this model using scikit-learn’s [54] implementation of a randomized solver [55] with a random seed of 100, an output of 50 principal components (PCs), and default settings for all other hyperparameters. After training the model, every preprint within the bioRxiv corpus receives a score for each generated PC. We sought to uncover concepts captured within generated PCs and used the cosine similarity metric to examine these concepts. This metric takes 2 vectors as input and outputs a score between −1 (most dissimilar) and 1 (most similar). We used this metric to score the similarity between all generated PCs and every token within our CBOW model for our use case. We report the top 100 positive and negative scoring tokens as word clouds. The size of each word corresponds to the magnitude of similarity, and color represents a positive (orange) or negative (blue) association. Discovering unannotated preprint–publication relationships The bioRxiv maintainers have automated procedures to link preprints to peer-reviewed versions, and many journals require authors to update preprints with a link to the published version. However, this automation is primarily based on the exact matching of specific preprint attributes. If authors change the title between a preprint and published version (for instance, [56,57]), then this change will prevent bioRxiv from automatically establishing a link. Furthermore, if the authors do not report the publication to bioRxiv, the preprint and its corresponding published version are treated as distinct entities despite representing the same underlying research. We hypothesize that close proximity in the document embedding space could match preprints with their corresponding published version. If this finding holds, we could use this embedding space to fill in links missed by existing automated processes. We used the subset of paper–preprint pairs annotated in CrossRef as described above to calculate the distribution of available preprint to published distances. We calculated this distribution by taking the Euclidean distance between the preprint’s embedding coordinates and the coordinates of its corresponding published version. We also calculated a background distribution, which consisted of the distance between each preprint with an annotated publication and a randomly selected article from the same journal. We compared both distributions to determine if there was a difference between both groups as a significant difference would indicate that this embedding method can parse preprint–published pairs apart. After comparing the 2 distributions, we calculated distances between preprints without a published version link with PMCOA articles that weren’t matched with a corresponding preprint. We filtered any potential links with distances greater than the minimum value of the background distribution as we considered these pairs to be true negatives. Lastly, we binned the remaining pairs based on percentiles from the annotated pairs distribution at the [0,25th percentile), [25th percentile, 50th percentile), [50th percentile, 75th percentile), and [75th percentile, minimum background distance). We randomly sampled 50 articles from each bin and shuffled these 4 sets to produce a list of 200 potential preprint–published pairs with a randomized order. We supplied these pairs to 2 coauthors to manually determine if each link between a preprint and a putative matched version was correct or incorrect. After the curation process, we encountered 8 disagreements between the reviewers. We supplied these pairs to a third scientist, who carefully reviewed each case and made a final decision. Using this curated set, we evaluated the extent to which distance in the embedding space revealed valid but unannotated links between preprints and their published versions. Measuring time duration for preprint publication process Preprints can take varying amounts of time to be published. We sought to measure the time required for preprints to be published in the peer-reviewed literature and compared this time measurement across author-selected preprint categories as well as individual preprints. First, we queried bioRxiv’s application programming interface (API) to obtain the date a preprint was posted onto bioRxiv as well as the date a preprint was accepted for publication. We did not include preprint matches found by our paper matching approach (see Discovering unannotated preprint–publication relationships). We measured time elapsed as the difference between the date a preprint was first posted on bioRxiv and its publication date. Along with calculating the time elapsed, we also recorded the number of different preprint versions posted onto bioRxiv. We used this captured data to apply the Kaplan–Meier estimator [58] via the KaplanMeierFitter function from the lifelines [59] (version 0.25.6) python package to calculate the half-life of preprints across all preprint categories within bioRxiv. We considered survival events as preprints that have yet to be published. We encountered 123 cases where the preprint posting date was subsequent to the publication date, resulting in a negative time difference, as previously reported [60]. We removed these preprints for this analysis as they were incompatible with the rules of the bioRxiv repository. We measured the textual difference between preprints and their corresponding published version after our half-life calculation by calculating the Euclidean distance for their respective embedding representation. This metric can be difficult to understand within the context of textual differences, so we sought to contextualize the meaning of a distance unit. We first randomly sampled with replacement a pair of preprints from the Bioinformatics topic area as this was well represented within bioRxiv and contains a diverse set of research articles. Next, we calculated the distance between 2 preprints 1,000 times and reported the mean. We repeated the above procedure using every preprint within bioRxiv as a whole. These 2 means serve as normalized benchmarks to compare against as distance units are only meaningful when compared to other distances within the same space. Following our contextualization approach, we performed linear regression to model the relationship between preprint version count with a preprint’s time to publication. We also performed linear regression to measure the relationship between document embedding distance and a preprint’s time to publication. For this analysis, we retained preprints with negative time within our linear regression model, and we observed that these preprints had minimal impact on results. We visualize our version count regression model as a violin plot and our document embeddings regression model as a square bin plot. Building classifiers to detect linguistically similar journal venues and published articles Preprints are more likely to be published in journals that publish articles with similar content. We assessed this claim by building classifiers based on document and journal representations. First, we removed all journals that had fewer than 100 papers in the PMC corpus. We held our preprint–published subset (see above section Mapping bioRxiv preprints to their published counterparts) and treated it as a gold standard test set. We used the remainder of the PMCOA corpus for training and initial evaluation for our models. Training models to identify which journal publishes similar articles is challenging as not all journals are the same. Some journals have a publication rate of at most hundreds of papers per year, while others publish at a rate of at least 10,000 papers per year. Furthermore, some journals focus on publishing articles within a concentrated topic area, while others cover many dispersive topics. Therefore, we designed 2 approaches to account for these characteristics. Our first approach focuses on articles that account for a journal’s variation of publication topics. This approach allows for topically similar papers to be retrieved independently of their respective journal. Our second approach is centered on journals to account for varying publication rates. This approach allows more selective or less popular journals to have equal representation to their high publishing counterparts. Our article-based approach identifies most similar manuscripts to the preprint query, and we evaluated the journals that published these identified manuscripts. We embedded each query article into the space defined by the word2vec model (see above section Constructing a document representation for life sciences text). Once embedded, we selected manuscripts close to the query via Euclidean distance in the embedding space. Once identified, we return articles along with journals that published these identified articles. We constructed a journal-based approach to accompany the article-based classifier while accounting for the overrepresentation of these high publishing frequency journals. We identified the most similar journals for this approach by constructing a journal representation in the same embedding space. We computed this representation by taking the average embedding of all published papers within a given journal. We then projected a query article into the same space and returned journals closest to the query using the same distance calculation described above. Both models were constructed using the scikit-learn k-Nearest Neighbors implementation [54] with the number of neighbors set to 10 as this is an appropriate number for our use case. We consider a prediction to be a true positive if the correct journal appears within our reported list of neighbors and evaluate our performance using 10-fold cross-validation on the training set along with test set evaluation. Web application for discovering similar preprints and journals We developed a web application that places any bioRxiv or medRxiv preprint into the overall document landscape and identifies topically similar papers and journals (similar to [61]). Our application attempts to download the full text xml version of any preprint hosted on the bioRxiv or medRxiv server and uses the lxml package (version num) to extract text. If the xml version isn’t available our application defaults to downloading the pdf version and uses PyMuPDF [62] to extract text from the pdf. The extracted text is fed into our CBOW model to construct a document embedding representation. We pass this representation onto our journal and article classifiers to identify journals based on the 10 closest neighbors of individual papers and journal centroids. We implemented this search using the scikit-learn implementation of k-d trees. To run it more cost-effectively in a cloud computing environment with limited available memory, we sharded the k-d trees into 4 trees. The app provides a visualization of the article’s position within our training data to illustrate the local publication landscape. We used SAUCIE [63], an autoencoder designed to cluster single-cell RNA-seq data, to build a two-dimensional embedding space that could be applied to newly generated preprints without retraining, a limitation of other approaches that we explored for visualizing entities expected to lie on a nonlinear manifold. We trained this model on document embeddings of PMC articles that did not contain a matching preprint version. We used the following parameters to train the model: a hidden size of 2, a learning rate of 0.001, lambda_b of 0, lambda_c of 0.001, and lambda_d of 0.001 for 5,000 iterations. When a user requests a new document, we can then project that document onto our generated two-dimensional space, thereby allowing the user to see where their preprint falls along the landscape. We illustrate our recommendations as a shortlist and provide access to our network visualization at our website (https://greenelab.github.io/preprint-similarity-search/). [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001470 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/