(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ Public human microbiome data are dominated by highly developed countries ['Richard J. Abdill', 'Department Of Genetics', 'Cell Biology', 'Development', 'University Of Minnesota', 'Minneapolis', 'Minnesota', 'United States Of America', 'Elizabeth M. Adamowicz', 'Ran Blekhman'] Date: 2022-02 Abstract The importance of sampling from globally representative populations has been well established in human genomics. In human microbiome research, however, we lack a full understanding of the global distribution of sampling in research studies. This information is crucial to better understand global patterns of microbiome-associated diseases and to extend the health benefits of this research to all populations. Here, we analyze the country of origin of all 444,829 human microbiome samples that are available from the world’s 3 largest genomic data repositories, including the Sequence Read Archive (SRA). The samples are from 2,592 studies of 19 body sites, including 220,017 samples of the gut microbiome. We show that more than 71% of samples with a known origin come from Europe, the United States, and Canada, including 46.8% from the US alone, despite the country representing only 4.3% of the global population. We also find that central and southern Asia is the most underrepresented region: Countries such as India, Pakistan, and Bangladesh account for more than a quarter of the world population but make up only 1.8% of human microbiome samples. These results demonstrate a critical need to ensure more global representation of participants in microbiome studies. Citation: Abdill RJ, Adamowicz EM, Blekhman R (2022) Public human microbiome data are dominated by highly developed countries. PLoS Biol 20(2): e3001536. https://doi.org/10.1371/journal.pbio.3001536 Academic Editor: Ken Cadwell, New York University School of Medicine, UNITED STATES Received: September 17, 2021; Accepted: January 11, 2022; Published: February 15, 2022 Copyright: © 2022 Abdill et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data has been deposited at Zenodo.org and is available at https://doi.org/10.5281/zenodo.5351179. This repository also includes the code used for data collection, along with the code used to generate each plot. Funding: This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R35GM128716 to RB; https://www.nigms.nih.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Abbreviations: ACCME, African Collaborative Center for Microbiome and Genomics Research; GWAS, genome-wide association study; NCBI, National Center for Biotechnology Information; PRS, polygenic risk score; SRA, Sequence Read Archive Background A growing body of research shows that the human microbiome has broad relevance to human health and disease. However, identifying the specific connections between the microbiome and human health requires a broad survey of both human populations and their most common health conditions. Even among healthy individuals, human microbiome composition varies between populations in ways that are still being uncovered: Geography and geographic relocation has been found to have an influence on microbiome composition [1–3], as have host genetic variation and ethnicity [4–6]. Diet [7], lifestyle [8], and patterns in antibiotic use [9] have all been linked to microbiome composition, with other studies considering the influence of locational factors such as pollution [10]. Even within countries, interacting factors such as income, race, and education have critical impacts on health outcomes that could be mediated by the human microbiome [11]. Some microbiome studies have specifically collected and compared data from global sites [12,13], but large gaps and disparities still exist in which microbiomes are being studied on a global scale. The human microbiome has been linked to a growing number of social, medical, and economic factors not directly related to host genetics, which reinforces the urgent need to evaluate the microbiomes of many populations [11,14]. Other genomics fields have developed similar gaps, in which disproportionate attention is paid to the majority populations of wealthy countries: Genome-wide association studies (GWASs), for example, have been primarily conducted in populations with European ancestry [15,16]. As a result, polygenic risk scores (PRSs) from these studies have poorer accuracy when applied to non-European groups, limiting the possible benefits of this research—including personalized medicine, early disease screening, and risk prediction—to European-descended populations [17–19]. There has been a concerted effort in genomics to include non-European individuals in GWAS studies, concurrent with calls to build research infrastructure and capacity globally [16]. It is likewise critical to identify underrepresented populations and locations in both genomics and microbiome research; otherwise, the benefits of host–microbiome research may only extend to a subset of the global population. To investigate the geographic distribution of microbiome studies, we used metadata on all human microbiome datasets in the BioSample database, which includes metadata describing samples in the Sequence Read Archive (SRA), DNA Data Bank of Japan, and European Nucleotide Archive [20]. Our data include the country of origin and time of release for more than 444,000 samples, including both 16S amplicon sequencing and shotgun metagenomic sequencing, released over the last 11 years. These samples from the 3 largest genomic databases represent a large majority of all human microbiome samples that have been published. Discussion Our results show that the global distribution of human microbiome sampling is heavily skewed toward North American and European populations, both in total samples (Fig 1A) and in samples adjusted for population (Fig 1B). The US is by far the greatest contributor to the database (Table 1), although this is slowly beginning to change as other countries’ contributions grow (Fig 1D). This neglect of most of the world’s population represents a disparity in microbiome research that could limit the health benefits of microbiome research to those countries and populations whose microbiomes have been extensively sampled and studied. Since only a subset of the world’s populations are currently being studied, the associations between the microbiome and disease may not hold in undersampled populations [26,27]. For example, Gupta and colleagues identified several differences in the microbiome of healthy individuals from various geographic locations and lifestyles across the globe; without a consistent “healthy” microbiome across global populations, identifying microbiome–disease associations is nearly impossible [26]. He and colleagues also found that microbiome-based models for predicting metabolic disease failed when applied to populations outside of the geographical location in which they were developed [27]. Additionally, by only sampling a subset of the global population, the diseases studied in the context of the microbiome are limited to diseases which impact that subset. Helminth parasite infections, for example, are common in tropical and subtropical regions of the world, but rare in North American and European populations. Undersampling of the microbiota from populations where these infections are common has led to a lack of clear understanding of the role of the microbiome in helminth colonization and resistance [28]. To ensure greater global equity in the benefits of microbiome research, many stakeholders—funders, researchers, and journals, to name a few—should consider how to ethically prioritize and incentivize improved global representation of microbiome samples, as they have begun to do in genomics with efforts such as the H3Africa initiative [29]. Others have also highlighted opportunities for growth in the microbiome field, such as developing infrastructure and processes in low-resource settings [30,31], building more comprehensive microbial reference databases, and pursuing more flexible and affordable sequencing technologies [32]. Importantly, this approach should be grounded in benefitting the populations and communities sampled, rather than simply using these microbiomes as a tool to improve health in North American and European countries [33,34]. Ongoing discussion of “helicopter research” (e.g., [35]) sheds light on ethical objections to “solving” research disparities with what essentially becomes charity, rather than collaboration: Researchers from wealthy countries obtain funding to do research in developing countries, “helicopter in” to collect data, then leave to publish their papers [36]. The result is more data from that country, but as part of a project that may not address the problems and priorities of the country under study. Local researchers, if they are consulted at all, may be excluded from authorship on the papers that are then hidden behind paywalls, written in a language they may not speak—part of much broader issues in scientific communication [37,38]. Researchers from the so-called “Global North” (as we are) would benefit from deferring to experienced scientists in these countries to find out how to avoid common extractive tropes in imbalanced collaborations (e.g., [35,39]). Research and discussion in other fields may also help scientists trying to build more inclusive research projects: Although there are no easy answers, essays in applied ecology [40–42], ocean science [43], botany [44], geography [45,46], and conservation [47], among many others, deal with the hallmarks and dangers of colonial science [48] and how researchers can change their approach to knowledge production. The reasons for, and solutions to, global disparities in scientific research go far beyond the scope of this paper, and indeed of the microbiome field. There are broader issues of global representation in science that we and others have discussed, for example, in terms of authorship [49], language [37], and the makeup of editorial boards [40]. The complex history and current conditions driving these disparities requires a comprehensive assessment of global sociopolitical factors that we, as biologists based in North America, are not able to fully address. However, the necessity of such an assessment as a way to solve these problems illustrates an important possible reason that these problems continue to perpetuate. Most microbiome researchers are not trained in social or political science and lack the appropriate tools to assess and address these problems. The more intentional inclusion of social scientists in microbiome projects may help address not only country-level imbalances, but also remediate harmful conventions used to deal with other issues like race [50]. Despite ongoing challenges, there have been several recent success stories of microbiome initiatives set in, driven by, and focused on countries and populations who have been historically left out of microbiome research. One such example is the recently convened Microbiome Task Force from the H3Africa Consortium; their goals are to harmonize and perform meta-analyses of microbiome data from H3Africa, build capacity and knowledge sharing among members, and provide data analysis support to researchers [51]. The Pan-African Bioinformatics Network (H3ABioNet), which has worked extensively in genomics research capacity building in Africa, also recently hosted a hackathon wherein they began work on a data portal for African microbiome samples [52]. In South America, the Brazilian Microbiome project and the recently proposed Ecuadorian Microbiome project both seek to advance microbiome research capacity in their respective countries and create local infrastructure to support these goals [53,54]. Initiatives such as H3Africa’s African Collaborative Center for Microbiome and Genomics Research (ACCME) [55] may be ideally positioned to make progress in these trends, although as research activity grows in these underrepresented countries, using public metadata may become a less viable measure of these disparities: ACCME’s 2 existing microbiome publications, for example, do not have information about data availability [56,57], and ongoing discussions about issues such as data sovereignty [58] raise important questions about whether making data publicly available is a just and sustainable approach to biomedical research in countries or populations with comparatively little power in the global research ecosystem [59–61]. There are several limitations to our study. Metadata quality is the primary hurdle in characterizing samples [62]: For example, our results suggest that data for some microbiome samples are misclassified as “Homo sapiens” data rather than “human metagenome” data, which makes them much more difficult to locate. As a result, some of the countries listed here with zero samples do have microbiome studies that were either submitted to databases that are challenging to access in bulk (e.g., Zenodo) or mislabeled in the SRA. However, the number of these misclassified samples is likely to be minor, and given the magnitude of differences observed in our study, this is unlikely to affect our main results (see Materials and methods). It is also possible that not all samples identified as human in this study are indeed from humans and could, for example, include studies using human gut microbiota transferred into mice. We also did not evaluate differences in host phenotypic information: Most samples are missing even basic information such as sex (77% missing) and age (79% missing), and the most prevalent tag indicating host health status, “host_disease,” is only available for 7.8% of samples (S1 Table). Consequently, we do not have sufficient information to draw conclusions about differences in geographic distribution between “healthy” and “disease” samples. Although disease-specific analysis is beyond the scope of our dataset, it would be interesting to investigate differences in the types of microbiome studies, and the questions they ask, on a global scale: If the human microbiome is generally understudied in a given country, it is likely that diseases prevalent in that country may also be lacking information about microbiome associations. We have also limited our database search to 3 databases (SRA, DNA Data Bank of Japan, and European Nucleotide Archive); it is possible that different patterns of global representation are present in other databases, such as MG-RAST [63] and gcMeta [64], although they are orders of magnitude smaller than the NCBI holdings. In addition, as it has been estimated that 20% of microbiome papers do not have publicly available data [65], our study only examines the subset of microbiome studies that also shared their data in the largest international repositories. Samples collected from the same host could occur in longitudinal studies or datasets in which biological replicates were submitted as separate BioSamples, a pattern that is difficult to evaluate across multiple studies that may identify subjects differently, if at all. If longitudinal studies happen more frequently in some regions than others, it is possible that the reported proportions of samples between countries could differ from the proportions of human subjects. However, given the differences in sample numbers between countries, this is unlikely to change the main results from our study. Moreover, since we are using sample collection as a proxy for investment in microbiome research in a given country, the identity of the subject may not be as relevant—indeed, it is likely more costly to perform a longitudinal study with subject follow-up than it is to recruit more subjects for a single sample each. Still, if longitudinal sampling is more common in studies in North America and Europe (which seems likely, given the extensive infrastructure and funding required for following patients long term), it is possible that the gap between the “Global North” and the rest of the world in terms of microbiome sampling is smaller than our results suggest, if we were to count subjects rather than samples. However, given the magnitude of the difference between countries in our study, we do not believe repeated sampling from the same individuals in the Global North alone can account for such drastic disparities in sample numbers. To conclude, we analyzed the geographic origins of almost a half-million samples from the largest genomic repositories in the world. We find evidence that the human microbiome field may be encountering some of the same flaws that arose in human genomics [66,67], in which much of the world is excluded and progress is focused on the priorities of the wealthy. The field would benefit from a more global perspective on investigating the human microbiome’s relationship to health and disease. Materials and methods A list of samples was exported from the NCBI BioSample database (https://www.ncbi.nlm.nih.gov/biosample) using the search string “txid408170[Organism] AND biosample sra[filter] AND “public”[filter],” which requests all samples classified under the “human gut metagenome” category in the NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi). The resulting sample IDs and all associated tags were loaded into a PostgreSQL database. We repeated this for all categories described as human metagenomes (Table 2). We note that the term “human gut metagenome” does not describe the sequencing technique used to generate the microbiome data, including shotgun metagenomics and amplicon sequencing—specifically, 301,700 samples (72.0%) are associated with sequencing runs that list the library strategy as “AMPLICON.” We then looked in other NCBI categories nested beneath the “organismal metagenomes” category that were not explicitly labeled “human” but were likely to contain some human samples [68]. We downloaded the metadata for samples classified under any NCBI category that was the “generic” version of a human one we had already collected—the “blood metagenome” category is the generic version of the “human blood metagenome” category, for example (S5 Table). We downloaded all sample data for any generic categories that had at least 1,000 samples, then evaluated the metadata to find which samples indicated they were taken from a human host. To do this, we used the value of the “host_taxid” field or, if that was blank, the value of “host,” to create a putative “host” value, and manually flagged any that explicitly indicated the sample was from a human—references to “human” or “Homo sapiens,” for example, or if the host included words such as “patient” or “crew member” and did not indicate another species. We evaluated 4,395 unique “host” values for 173,038 samples and found 501 values assigned to 29,934 samples (17.3%) that indicated the host was a human. These were also included in the analysis. The sample data were collected between April and June 2021; to minimize the effect of collecting some body sites after others, only samples dated prior to 2021 were included here. We then used the NCBI eUtils API to find “runs” associated with each sample, so we could ensure all the BioSamples were associated with actual sequencing data. In the NCBI system, “runs” are the entities associated with sequencing data. We also used this API to obtain information on publication date, library strategy, and the dates on which samples became publicly available. This resulted in a collection of 444,829 samples across 19 body sites (Table 2) after removing several hundred samples that were missing dates or sequencing data. Representation proportions To determine which countries were over- or underrepresented relative to their populations, we obtained the 2020 population estimates for all countries as estimated by the United Nations [23]. We used this to calculate 2 percentages for each country, one for the country’s share of the global population and another for the country’s share of human microbiome samples. We then calculated a representation index: For countries with a higher sample percentage than population percentage, we divided the former by the latter to obtain a number indicating how many times more samples are present than expected. For countries with a lower sample percentage than population percentage, we took the negative reciprocal of this number, indicating (in negative numbers) the number one would have to multiply the sample count by to get the number that would be proportionally representative. The interim result leaves overrepresented countries with positive scores and underrepresented countries with negative scores. After removing the scores for countries with 50 or fewer samples, we scaled the positive scores to fall between 0 and 100 and separately scaled the negative scores to fall between 0 and −100. We then plotted these on the map using the “log 10” transformation to add more variation in the color coding for the countries with middling scores. For the regional calculations (Fig 1C and 1D), we used top-level classifications from the same United Nations document. Antarctica is not included in a region, so those samples were added to the “Unknown” category for region-level calculations. To better understand gaps in what data may be available outside of these large centralized repositories evaluated here, we selected several countries with zero attributed samples and did a literature search to determine whether human microbiome studies had been performed there and, if so, where the data are stored. For example, we could not confirm any samples available from Kazakhstan (population 18.7 million) in central Asia, but a human gut microbiome study from there was published in 2020 [69]; its raw sequencing data (but no phenotypic information) are available on Zenodo, a scientific data repository with many submissions but no way of searching for samples or projects. Another Kazakhstan microbiome study [70] is linked to publicly available sequencing data (BioProject PRJEB17632), but with incorrect metadata: Samples are classified as human sequencing data, rather than metagenomic, an issue addressed directly in the SRA submission instructions [71]. In addition, geolocation metadata was submitted, but listed the country of origin as Germany, the location of the senior author (and presumably the sequencing center), rather than Kazakhstan, and the geographical source of the sample, as requested by NCBI [22], although instructions can differ between repositories [62]. A study in Honduras (population 9.9 million) includes SRA data with accurate geolocation information (BioProject PRJEB31759), but the samples were again classified under “Homo sapiens” rather than “human metagenome” [72]. Visualization All figures were made using R and the ggplot2 package [73]. Maps use the Equal Earth projection [74] and the rnaturalearth R package [75]. Acknowledgments We thank the members of the Blekhman lab for helpful discussions and the Minnesota Supercomputing Institute for providing computational resources. [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001536 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/