(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ The characteristics of early-stage research into human genes are substantially different from subsequent research ['Thomas Stoeger', 'Department Of Chemical', 'Biological Engineering', 'Northwestern University', 'Evanston', 'Illinois', 'United States Of America', 'Northwestern Institute On Complex Systems', 'Nico', 'Center For Genetic Medicine'] Date: 2022-02 Throughout the last 2 decades, several scholars observed that present day research into human genes rarely turns toward genes that had not already been extensively investigated in the past. Guided by hypotheses derived from studies of science and innovation, we present here a literature-wide data-driven meta-analysis to identify the specific scientific and organizational contexts that coincided with early-stage research into human genes throughout the past half century. We demonstrate that early-stage research into human genes differs in team size, citation impact, funding mechanisms, and publication outlet, but that generalized insights derived from studies of science and innovation only partially apply to early-stage research into human genes. Further, we demonstrate that, presently, genome biology accounts for most of the initial early-stage research, while subsequent early-stage research can engage other life sciences fields. We therefore anticipate that the specificity of our findings will enable scientists and policymakers to better promote early-stage research into human genes and increase overall innovation within the life sciences. Funding: TS was supported by a grant of the National Institute on Aging (K99AG068544). LANA was supported by grants of the National Science Foundation (1956338), Air Force Office of Scientific Research (FA9550-19-1-0354), National Institute of Allergy and Infectious Diseases (U19AI135964) and Simons Foundation (DMS-1764421), and a gift by John and Leslie McQuown. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Introduction A stream of research [1–12] has now established that research into human genes currently investigates largely those genes that were already well studied in the past. Curiously, this narrow focus can only partially be explained by the relevance of these genes toward human health or physiology [9,13,14]. Rather, research suggests that this narrow focus can be primarily explained by physical, chemical, and biological properties of gene products that facilitated pregenomic investigations [11]. This insight is neither unique to genes nor surprising as science is a difficult and highly competitive endeavor, and researchers can therefore find themselves picking “the lowest-hanging fruit” within their chosen scientific domain [15,16]. Research directed toward human health has historically received high levels of financial and political support [17,18]. Along these lines, and irrespective of research into genes, the former president of the European Research Council postulated that research directions will only continue to receive stable public support if they align with societal goals [19]. We therefore anticipate that research toward little characterized genes could present researchers and policymakers with novel opportunities for aligning their efforts with societal goals—particularly if they concern genes implicated in human disease. For instance, we recently reported that over half of the human host genes relevant to Coronavirus Disease 2019 (COVID-19) lie outside of past patterns of scientific inquiry and therefore have remained uncharacterized in the months that followed the manifestation of COVID-19 as a global health burden [20]. Likewise, distinct research groups reported that genome-wide datasets identify genes likely to be important for cancer and neurological diseases, but that those genes have otherwise not been characterized [9,11,14]. The narrowness of the focus of life sciences research on human genes described above also raises practical concerns. For instance, current biological knowledge might be incomplete for many genes and processes [14], which may hinder the discovery of important associations. Indeed, gene ontology enrichment analyses could only reliably retrieve the well-known—since the 1960s [21]—association between cancer and the cell cycle from the late 2000s onward because only a small fraction of genes have been actively investigated. Further, studies of protein–protein interaction networks suggest that, physiologically, more essential genes also encode for proteins with more interaction partners. However, a recent study reported no evidence for such an association [22] after accounting for the narrow focus of the life sciences on a subset of genes through alternate experimentation or added normalization [23]. Partially echoing these societal and practical issues, several researchers and policymakers started efforts to promote research into little characterized genes. For instance, an international consortium, supported by the largest funding agencies of Europe and the United States, called for the establishment of a deep genome project to characterize all human genes and their orthologous murine genes [24]. Beyond such calls for actions, the largest funding source within the life sciences, the National Institutes of Health (NIH) of the US, already established a dedicated program to promote research into little characterized genes [10] and started to solicit funding opportunities that are restricted to genes that appear to be undercharacterized. The partial success of the underlying policies is already measurable as a recent bibliometric study found that during the most recent years, research into novel gene targets received a disproportionally high level of support from the NIH when compared to other sources of funding [25]. We believe that by understanding the factors that in the past have resulted in early-stage research on novel genes, we will be better placed to facilitate the design of initiatives and policies that support research into new or distinct sets of genes. While we can anticipate generalizing studies of science and innovation [26] to be helpful—particularly if those insights stem from studies based on life sciences research [27]—it remains unclear to which extent they offer appropriate guidance when studying genes and their encoded molecular products. Specifically, research into human protein-coding genes might be distinct from most other domains of scientific inquiry as scientists have been aware of the nearly complete [28] space of all possibly explorable entities after the advanced draft of the human genome sequence was published in 2001 [29–31]. Further, general insights into science and innovation may not suffice to identify the specific contexts that have enabled early-stage research into genes. This realization motivated us to revisit the past 50 years of research into human genes through discipline-wide datasets and delineate the contexts under which early-stage research into genes occurred. Recognizing the value and ideas of earlier studies into science and innovation, we will, throughout our manuscript, introduce such earlier studies while pursuing hypotheses postulated or derived from them. Complementing these hypothesis-driven inquiries, we will also direct statistical approaches toward the available literature on human protein-coding genes. Briefly, we will find through hypothesis-driven inquiries that early-stage research is produced by larger scientific teams, mentions more genes in titles or abstracts, and accrues more citations. However, its citation dynamics unexpectedly do not reveal a trade-off between high risks and high rewards. Nor is early-stage research separate from clinical trials. Further, we demonstrate that distinct phases of early-stage research affect citation patterns in the scientific literature differently. Through statistical analyses, we will then pinpoint the initial stage of early-stage research toward genome biology and reveal that subsequent early-stage research can be accounted for by a handful of medically important genes that have recently become investigated by scientific fields studying obesity and age-related neurological disease. Characteristics of publication outlets for early-stage research Our preceding bibliometric analyses demonstrate that early-stage research into human genes is produced under distinct conditions and that it is distinctively received by fellow scientists. However, the preceding analysis does not identify the specific contexts under which early-stage research into genes is produced or whether these contexts change over time. We begin our identification of contexts by inspecting scientific journals and focusing on the period 2010 to 2018. While editors working for scientific journals appear to personally value novelty [44], the peer-reviewing process within scientific journals had been postulated to limit novelty [45–47]. First, we focus on those journals that published research on novel gene targets (Fig 3A, S7 Fig). We find an overall trend (gray dots) suggesting that the number of research publications reporting on novel gene targets scales with the number of research publications studying old gene targets (Spearman: 0.49, S7 Fig). This means that, as a first approximation, early-stage research into novel gene targets tends to appear in journals that publish many papers on human genes. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 3. Generalist journals and journals for genome research and age-related neuropathies associate with early-stage research in the period of 2010 to 2018. (A) Enrichment (log 2 of ratio) for publications with at least 1 new gene target (never highlighted in a preceding year; marked as 0-year genes). Each circle represents an individual journal; red (blue) circles indicate journals that significantly enrich (deplete) for publications on novel gene targets at Bonferroni multiple testing corrected Fisher exact test with p < 0.01. (B) Enrichment for publications on recent gene targets (first highlighted 1 to 5 years before). (C) Comparison of fold enrichment for journals that are significantly enriched or depleted for new gene targets (panel A) and/or recent gene targets (panel B). Note that for any given journal, the enrichment may not be statistically significant for both axes. Notably, the light cyan quadrant shows journals that are depleted from early-stage research. Note also that circles for the journals Neurocase and Amyotroph Lateral Scler Frontotemporal Degener are plotted “outside of the range of the axis” because they do not have any publication on new gene targets. Journals above dashed line enrich for recent gene targets at least twice as much as they do for novel gene targets. Combines data from MEDLINE, NCBI gene and taxonomy information, gene2pubmed, and PubTator. For data underlying the figure, see https://doi.org/10.21985/n2-b5bm-3b17. https://doi.org/10.1371/journal.pbio.3001520.g003 Nonetheless, we find that some journals enrich for research publications on novel gene targets to an extent that is higher than anticipated by chance (red dots) (Fig 3A, S7 Fig). Manually inspecting their identity from the 1990s onward (S7 Fig), we believe to recognize 2 groups of journals. The first group consists of journals dedicated primarily to genome biology, whereas the second consists of journals that target an interdisciplinary audience and, in the cases of Nature, Science and PNAS, also publish research beyond the life sciences. We conclude that these 2 groups of journals are particularly effective at attracting innovative research into human genes. Next, we focus on research publications on gene targets that were first highlighted in the 5 preceding years (Fig 3B, S7 Fig). We again find that some journals enrich for publications on early-stage research on gene targets to an extent that is higher than anticipated by chance. Again, many of those journals are directed toward genome biology, whereas a second group of journals have a greater disciplinary focus. Several of the journals in this second group relate to neurobiology and obesity (Fig 3B). In contrast to the analysis for the enrichment for novel gene targets, when considering early-stage research, we find that few journals deplete for publications on early-stage research on gene targets to an extent that is higher than anticipated by chance (cyan dots) (Fig 3B, S7 Fig). Most seem to focus on cancer biology, but a noteworthy exception is the Genetics and Molecular Research journal (Genet Mol Res), which had been listed on Beall’s list of potential predatory journals [48]. Comparing the extent to which journals enrich for newly highlighted versus recently highlighted genes, we find that journals dedicated to neurobiology and obesity are enriching for recently highlighted genes to a larger extent than their enrichment for newly highlighted genes (Fig 3C). We conclude that while there are strong similarities for the publishing contexts that surround newly and recently highlighted genes, some research fields might be more receptive toward publishing research publications on gene targets that had been identified recently. Expanding on the latter insight, we will return to research into genome biology, neurobiology, and obesity in later sections of this manuscript. Characteristics of institutional support for early-stage research We round off our identification of specific contexts that may support early-stage research by turning toward institutions. We consider funding agencies, distinct funding mechanisms, and NIH-supported research organizations (Fig 4, S8–S11 Figs). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 4. Institutional support for early-stage research associates with genome biology in the period of 2010 to 2018. (A) Enrichment (log 2 of ratio) for the support of specific funding agencies in publications highlighting at least 1 recent gene target (first highlighted 1 to 5 years before). Each circle represents an individual journal; red (blue) circles indicate journals that significantly enrich (deplete) for publications on novel gene targets at Bonferroni multiple testing corrected Fisher exact test with p < 0.01. (B) Enrichment for author affiliation to research organizations in publications highlighting at least 1 recent gene target. (C) Comparison of fold enrichment for funding agencies that are significantly enriched or depleted for new gene targets or recent gene targets. (D) Comparison of fold enrichment for research organizations that are significantly enriched or depleted for new gene targets or recent gene targets. Note that for any given funding agency or research organization, the enrichment may not be statistically significant for both axes. Notably, the light cyan quadrant shows funding agencies and research organizations that are depleted from early-stage research. Combines data from MEDLINE, NCBI gene and taxonomy information, gene2pubmed, PubTator, and ExPORTER. For data underlying the figure, see https://doi.org/10.21985/n2-b5bm-3b17. https://doi.org/10.1371/journal.pbio.3001520.g004 The contexts implicated through funding agencies resemble those presented among journals—with the support for early-stage research between 2010 and 2018 preferentially stemming from genome biology (NIH’s National Human Genome Research Institute [NHGRI]—one of several different agencies within the NIH), generalist funding agencies (NIH’s National Institute of General Medical Sciences [NIGMS], European Research Council, etc.), or funders of age-related neuropathies (Motor Neurone Disease Association and NIH’s National Institutes on Aging [NIA]) (Fig 4A–4C, S8 Fig). In line with a prior study that reported that Howard Hughes Medical Institute Investigators may be more willing to change their research focus [27], we also find that publications supported by the Howard Hughes Medical Institute are enriched for early-stage research (Fig 4A–4C, S8 Fig). Finally, we also observe that publications authored by NIH intramural scientists are enriched for early-stage research. We find that preferences of funding agencies toward early-stage research can be dynamic (S8 Fig). While the NIH’s NHGRI and NIGMS have consistently supported researchers producing publications enriched for early-stage research, the NIH’s National Cancer Institute (NCI) supported researchers producing publications enriched for early-stage research in the 1980s and 1990s, but since 2010 has been supporting researchers producing publications depleted for early-stage research (S8 Fig). This change contrasts with the funding patterns of Cancer Research UK, which has consistently supported researchers producing publications enriched for early-stage research (Fig 4B–4D, S8 Fig). This shows that the focus on early-stage research among domain-specific funding agencies can differ across countries and can change over time as those agencies reassess and reprioritize the areas of research they will pursue. Funding agencies can allocate their funds through a plethora of distinct mechanisms, which vary in the scope of eligible scientists and research organizations, the duration of the support, amount of funding, and objective of the supported research. Because of data availability, we will again focus on the NIH, which informs on its different funding mechanisms through “activity codes.” Overall, we find 112 different activity codes to have been used within 2010 and 2018, of which only 8 support publications that significantly enrich for early-stage research (S9 and S10A Figs). These activity codes also appear marginally more frequently deployed by those institutes of the NIH that disproportionally enrich for early-stage research (Mann–Whitney U: 0.05) (S10B Fig). Among these 8 activity codes, 3 relate to NIH’s intramural research program (N01, ZIA, ZIB), 2 relate to research activities funded through contracts rather than research grants (U01, Z01), another relates to funding directed toward the career development of PostDocs (F32), and another is directed toward supporting high impact interdisciplinary science (RC2). The last activity code supporting publications enriching for early-stage research is the R01, the most common activity code (S9 Fig). While the enrichment is statistically significant for R01-type grants due to the very large number of observations, the magnitude of the enrichment is miniscule (6%). We thus conclude that only a minority of funding mechanisms preferentially supports early-stage research and that, essentially, only 2 of them correspond to funding that is allocated through extramural research grants. Concluding our analysis of the research contexts under which early-stage research occurs, we investigate research organizations that are recipients of NIH grants (Fig 4B–4D, S11 Fig). The strongest enrichments for early-stage research are seen in a group of smaller research organizations—namely Geisinger Health Systems, the New England Deaconess Hospital, and the Icelandic Heart Association. Manually inspecting their early-stage publications from 2010 onward, we see that each publication either describes a genome-wide association study, its verification, or a meta-analysis of association studies. Among the larger research organizations, we see (filed under 2 different names) the University of Washington, which historically played a prominent role in the Human Genome Project [49], and the Broad Institute, which, according to its mission statement, “was founded in 2004 to fulfill the promise of genomic medicine” [50]. Complementing above research organizations, we also see the intramural branch of the NIH’s NHGRI and NIA. Our unbiased analysis of publication outlets, funding agencies, and research organizations thus all point toward genome biology as disproportionally promoting early-stage research into human genes. Domain-specific research into a handful of genes contributes to early-stage research While approaches used within genome biology are well positioned to identify novel gene targets [29] (Fig 5A), we previously observed that early-stage research on recent gene targets differs from early-stage research on novel gene targets by further extending to journals dedicated to neurobiology and obesity (Fig 3). This prompts us to ask whether recent gene targets that are of interest to specific domains of biology become the focus of greater interest than other recent gene targets. We again focus here on the period of 2010 to 2018 for timeliness, but similar analyses for other periods are included in S17 Fig. During this time period, we find that 3,943 newly highlighted genes could have been investigated as recent gene targets (Fig 1A). A total of 1,656 (42%) were never highlighted during the same time period, and 885 (22%) are highlighted only once more. By contrast, and supporting our hypothesis, we find that a small number of recent gene targets become the focus of dramatically greater interest. Specifically, the 40 top ranked genes represent a little over 1% of the newly highlighted genes, but 21% of all unique pairs of recent gene targets and publications (Fig 6A, S17 Fig). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 6. Early-stage research focuses on a handful of genes and is driven by specific domains. (A) Cumulative distribution of pairs of highlighted genes and 2010 to 2018 publications highlighting recent gene targets (first highlighted 1 to 5 years before). The x-axis ranks genes by number of publications highlighting them, with rank 0 corresponding to the gene highlighted in the most publications. The y-axis tallies the percentage of all pairs of highlighted genes and 2010 to 2018 publications highlighting recent gene targets with a lower or equal rank. We emphasize in orange the top 1% genes (40 genes), which together account for 21% of all pairs. Note that for simplicity of data representation, we allow for overlapping and shifting time windows. For example, a gene that has been first highlighted 5 years prior the start the indicated decade would will only be represented by a single year in our analysis. (B) Letter plot of all highlighted genes reporting for each gene the total number of publications until 2018 for genes that still were a recent target during the time 2010 to 2018 period, but were not (“Not top 40,” blue) or were in the “top 40” (“Top 40,” orange) and “other” genes (gray). Area of boxes indicates share of values, with heights of boxes following letter style dimensions (innermost boxes contain 25 to 75 percentiles of values and subsequent boxes 12.75 to 87.5 percentiles of values, etc.). Note that for 2010 to 2018, “other” is practically synonymous with genes first highlighted before 2005. Note further that some the genes of the “Not top 40” category may still accrue more publications until 2018, the final year of our analysis given the presence of shifting time windows. (C) Journals that enrich for top recent gene targets preferentially highlight top 40 genes. Percentage of pairs of publications and recent gene targets that fall onto the top 40 recently highlighted genes. Journals are considered to disproportionally enrich for recent gene targets if they enrich for recent gene targets at least twice as much as they enrich for novel gene targets (above dashed line in Fig 3C). (D) Different top 40 genes are highlighted in publications from journals dedicated to different fields. The heatmap plot shows the percentage of publications highlighting a given recently highlighted gene from among all publications in the given journal. Shown are journals that highlight recent gene targets and enrich for recent gene targets at least twice as much as they enrich for novel gene targets (above dashed line in Fig 3C). Genes and journals are ordered and grouped using Ward clustering. Combines data from MEDLINE, NCBI gene and taxonomy information, gene2pubmed, and PubTator. For data underlying the figure, see https://doi.org/10.21985/n2-b5bm-3b17. https://doi.org/10.1371/journal.pbio.3001520.g006 This small subset of top ranked recently highlighted genes defies the general trend of biomedical research rarely turning to new gene targets (Fig 1B and 1C, S3 Fig) [14]. By contrast, this unusual group of 40 genes (0.2% of all 19,171 human protein-coding genes) have already been highlighted in more publications than most human genes (Fig 6B). The most frequently highlighted genes within this group are FTO, C9orf72, and PALB2. By 2018, they had been highlighted in the title and abstract of more publications than 98%, 97%, and 94%, respectively, of protein-coding genes. These 3 genes strongly associate with obesity [51], dementia [52,53], and cancer [54,55], respectively. Interestingly, while FTO and C9orf72 were initially discovered through genome-wide association studies (and thus through genome biology), PALB2 was initially discovered through its biochemical interaction with BRCA2, a prominent cancer gene [56]. These results highlight the potential biological importance of novel gene targets and the potential to become heavily investigated within a few years. Finally, we return to our observation of neurological and obesity-related journals enriching for early-stage gene targets to an extent that exceeds their enrichment on novel gene targets (10 journals above dashed line, Fig 3C). Plotting the share of their targets which fall onto the 40 top ranked genes, we find that it ranges between 32% and 100% and thus significantly exceeds share observed among other journals that enrich for recent gene targets (Fig 6C). To identify whether these journals focus on similar gene targets, we next cluster these journals according to the subset (15) of the 40 top ranked genes that have been highlighted by them at least once (Fig 6D). We recognize 3 groups. Seven journals—all dedicated to neurology—primarily include C9orf72 as a recent gene target. Two journals about obesity primarily include FTO as a research target, and to a lesser extent, 5 other genes linked to obesity [57]. Finally, one cluster is formed by a single journal about vision, which primarily highlights ARMS2—a gene implicated in age-related maculopathy. If we remove the 40 top ranked genes from our analysis, only 2 of these journals (Brain and Obesity) remain enriched for recent gene targets. In summary, we conclude that domain-specific research can contribute to early-stage research, but primarily does so through a handful of genes. [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001520 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/