(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . A data compendium associating the genomes of 12,289 Mycobacterium tuberculosis isolates with quantitative resistance phenotypes to 13 antibiotics [1] ['The Cryptic Consortium', 'University Of Oxford', 'Oxford', 'United Kingdom'] Date: 2022-08 Abstract The Comprehensive Resistance Prediction for Tuberculosis: an International Consortium (CRyPTIC) presents here a data compendium of 12,289 Mycobacterium tuberculosis global clinical isolates, all of which have undergone whole-genome sequencing and have had their minimum inhibitory concentrations to 13 antitubercular drugs measured in a single assay. It is the largest matched phenotypic and genotypic dataset for M. tuberculosis to date. Here, we provide a summary detailing the breadth of data collected, along with a description of how the isolates were selected, collected, and uniformly processed in CRyPTIC partner laboratories across 23 countries. The compendium contains 6,814 isolates resistant to at least 1 drug, including 2,129 samples that fully satisfy the clinical definitions of rifampicin resistant (RR), multidrug resistant (MDR), pre-extensively drug resistant (pre-XDR), or extensively drug resistant (XDR). The data are enriched for rare resistance-associated variants, and the current limits of genotypic prediction of resistance status (sensitive/resistant) are presented by using a genetic mutation catalogue, along with the presence of suspected resistance-conferring mutations for isolates resistant to the newly introduced drugs bedaquiline, clofazimine, delamanid, and linezolid. Finally, a case study of rifampicin monoresistance demonstrates how this compendium could be used to advance our genetic understanding of rare resistance phenotypes. The data compendium is fully open source and it is hoped that it will facilitate and inspire future research for years to come. Citation: The CRyPTIC Consortium (2022) A data compendium associating the genomes of 12,289 Mycobacterium tuberculosis isolates with quantitative resistance phenotypes to 13 antibiotics. PLoS Biol 20(8): e3001721. https://doi.org/10.1371/journal.pbio.3001721 Academic Editor: Jason Ladner, Northern Arizona University, UNITED STATES Received: April 6, 2022; Accepted: June 21, 2022; Published: August 9, 2022 Copyright: © 2022 The CRyPTIC Consortium. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Data are available from ftp.ebi.ac.uk/pub/databases/cryptic/release_june2022/ The FTP site contains two top level directories: “reuse” and “reproducibility”. All data for this study were analysed and visualised using either R or python3 libraries and packages. See github.com/kerrimalone/Brankin_Malone_2022 for codebase. Funding: This work was supported by Wellcome Trust/Newton Fund-MRC Collaborative Award (200205/Z/15/Z); and Bill & Melinda Gates Foundation Trust (OPP1133541). Oxford CRyPTIC consortium members are funded/supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC), the views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health, and the National Institute for Health Research (NIHR) Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, a partnership between Public Health England and the University of Oxford, the views expressed are those of the authors and not necessarily those of the NIHR, Public Health England or the Department of Health and Social Care. J.M. is supported by the Wellcome Trust (203919/Z/16/Z). Z.Y. is supported by the National Science and Technology Major Project, China Grant No. 2018ZX10103001. K.M.M. is supported by EMBL’s EIPOD3 programme funded by the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska Curie Actions. T.C.R. is funded in part by funding from Unitaid Grant No. 2019-32-FIND MDR. R.S.O. is supported by FAPESP Grant No. 17/16082-7. L.F. received financial support from FAPESP Grant No. 2012/51756-5. B.Z. is supported by the National Natural Science Foundation of China (81991534) and the Beijing Municipal Science & Technology Commission (Z201100005520041). N.T.T.T. is supported by the Wellcome Trust International Intermediate Fellowship (206724/Z/17/Z). G.T. is funded by the Wellcome Trust. R.W. is supported by the South African Medical Research Council. J.C. is supported by the Rhodes Trust and Stanford Medical Scientist Training Program (T32 GM007365). A.L. is supported by the National Institute for Health Research (NIHR) Health Protection Research Unit in Respiratory Infections at Imperial College London. S.G.L. is supported by the Fonds de Recherche en Santé du Québec. C.N. is funded by Wellcome Trust Grant No. 203583/Z/16/Z. A.V.R. is supported by Research Foundation Flanders (FWO) under Grant No. G0F8316N (FWO Odysseus). G.M. was supported by the Wellcome Trust (098316, 214321/Z/18/Z, and 203135/Z/16/Z), and the South African Research Chairs Initiative of the Department of Science and Technology and National Research Foundation (NRF) of South Africa (Grant No. 64787). The funders had no role in the study design, data collection, data analysis, data interpretation, or writing of this report. The opinions, findings and conclusions expressed in this manuscript reflect those of the authors alone. L.G. was supported by the Wellcome Trust (201470/Z/16/Z), the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number 1R01AI146338, the GOSH Charity (VC0921) and the GOSH/ICH Biomedical Research Centre (www.nihr.ac.uk). A.B. is funded by the NDM Prize Studentship from the Oxford Medical Research Council Doctoral Training Partnership and the Nuffield Department of Clinical Medicine. D.J.W. is supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (Grant No. 101237/Z/13/B) and by the Robertson Foundation. A.S.W. is an NIHR Senior Investigator. T.M.W. is a Wellcome Trust Clinical Career Development Fellow (214560/Z/18/Z). A.S.L. is supported by the Rhodes Trust. R.J.W. receives funding from the Francis Crick Institute which is supported by Wellcome Trust, (FC0010218), UKRI (FC0010218), and CRUK (FC0010218). T.C. has received grant funding and salary support from US NIH, CDC, USAID and Bill and Melinda Gates Foundation. The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC (BRC-1215-20008). Parts of the work were funded by the German Center of Infection Research (DZIF). The Scottish Mycobacteria Reference Laboratory is funded through National Services Scotland. The Wadsworth Center contributions were supported in part by Cooperative Agreement No. U60OE000103 funded by the Centers for Disease Control and Prevention through the Association of Public Health Laboratories and NIH/NIAID grant AI-117312. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: E.R. is employed by Public Health England and holds an honorary contract with Imperial College London. I.F.L. is Director of the Scottish Mycobacteria Reference Laboratory. S.N. receives funding from German Center for Infection Research, Excellenz Cluster Precision Medicine in Chronic Inflammation, Leibniz Science Campus Evolutionary Medicine of the LUNG (EvoLUNG)tion EXC 2167. P.S. is a consultant at Genoscreen. T.R. is funded by NIH and DoD and receives salary support from the non-profit organization FIND. T.R. is a co-founder, board member and shareholder of Verus Diagnostics Inc, a company that was founded with the intent of developing diagnostic assays. Verus Diagnostics was not involved in any way with data collection, analysis or publication of the results. T.R. has not received any financial support from Verus Diagnostics. UCSD Conflict of Interest office has reviewed and approved T.R.’s role in Verus Diagnostics Inc. T.R. is a co-inventor of a provisional patent for a TB diagnostic assay (provisional patent #: 63/048.989). T.R. is a co-inventor on a patent associated with the processing of TB sequencing data (European Patent Application No. 14840432.0 & USSN 14/912,918). T.R. has agreed to “donate all present and future interest in and rights to royalties from this patent” to UCSD to ensure that he does not receive any financial benefits from this patent. S.S. is working and holding ESOPs at HaystackAnalytics Pvt. Ltd. (Product: Using whole genome sequencing for drug susceptibility testing for Mycobacterium tuberculosis). G.F.G. is listed as an inventor on patent applications for RBD-dimer-based CoV vaccines. The patents for RBD-dimers as protein subunit vaccines for SARS-CoV-2 have been licensed to Anhui Zhifei Longcom Biopharmaceutical Co. Ltd, China. Abbreviations: AMI, amikacin; AMR, antimicrobial resistance; AMyGDA, Automated Mycobacterial Growth Detection Algorithm; BDQ, bedaquiline; CFZ, clofazimine; CRyPTIC, Comprehensive Resistance Prediction for Tuberculosis: an International Consortium; CTAB, cetyltrimethylammonium bromide; DLM, delamanid; DR-TB, drug-resistant TB; DST, drug susceptibility testing; ECOFF, epidemiological cutoff; EMB, ethambutol; ETH, ethionamide; FN, false negative; FP, false positive; GCP, genotype confidence percentile; Hr-TB, isoniazid resistant and rifampicin susceptible; INH, isoniazid; KAN, kanamycin; LEV, levofloxacin; LZD, linezolid; MDR, multidrug resistant; ME, major error rate; MIC, minimum inhibitory concentration; MXF, moxifloxacin; NPV, negative predictive value; NRD, new and repurposed drug; PPV, positive predictive value; pre-XDR, pre-extensively drug resistant; RFB, rifabutin; RIF, rifampicin; RRDR, rifampicin resistance–determining region; RMR, rifampicin monoresistant; RR, rifampicin resistant; SNP, single nucleotide polymorphism; TB, tuberculosis; TDR, totally drug resistant; TN, true negative; TP, true positive; VCF, variant call format; VME, very major error rate; WGS, whole-genome sequencing; WHO, World Health Organisation; XDR, extensively drug resistant Introduction Tuberculosis (TB) is a curable and preventable disease; 85% of those afflicted can be successfully treated with a 6-month regimen. Despite this, TB is the world’s top infectious disease killer (current SARS-CoV-2 pandemic excepted) with 10 million new cases and 1.2 million deaths estimated in 2019 alone [1]. Furthermore, drug-resistant TB (DR-TB; please see Table A in S1 File for a list of acronyms used throughout the manuscript) is a continual threat; almost half a million cases resistant to the first-line drug rifampicin (RR-TB) were estimated, with three-quarters of these estimated to be multidrug-resistant (MDR-TB, resistant to first-line drugs isoniazid and rifampicin) [1]. Worryingly, only 44% of DR-TB cases were officially notified and just over half of these cases were successfully treated (57%) [1]. To address these issues, the World Health Organisation (WHO) is encouraging the development of better, faster, and more targeted diagnostic and treatment strategies through its EndTB campaign [1,2]. Of particular interest is universal drug susceptibility testing (DST). Conventionally, DST relies on lengthy (4 weeks minimum) culture-based methods that require strict biosafety conditions for Mycobacterium tuberculosis. The development of rapid genetics-based assays has decreased diagnostic time to as little as 2 hours through the detection of specific resistance conferring mutations, e.g., the Cepheid Xpert MTB/RIF test [3,4]. However, assay bias towards specific genic regions can result in misdiagnosis of resistance, the prescription of ineffective treatment regimens, and subsequent spread of MDR disease, as seen during an MDR outbreak in Eswatini [5–7]. Furthermore, detection of rifampicin resistance is used to infer MDR-TB epidemiologically as rifampicin resistance tends to coincide with resistance to isoniazid [8]. While this modus operandi is successful at pragmatically identifying potential MDR cases quickly and effectively, it is not generally true that a single path exists for developing MDR or extensively drug resistant TB (XDR = MDR/RR + resistance to at least 1 fluoroquinolone and either bedaquiline or linezolid). Whole-genome sequencing (WGS) has the potential to reveal the entirety of the M. tuberculosis genetic resistance landscape for any number of drugs simultaneously while enabling a more rapid turnaround time and reduction in cost compared to culture-based DST methods [9]. However, the success of WGS as a diagnostic tool wholly depends on there being a comprehensive and accurate catalogue of resistance-conferring mutations for each drug. Recent advances have shown that genotypic predictions of resistance correlate well with DST measurements for first-line drugs [8]. However, the mechanisms of resistance to second-line drugs along with the new and repurposed drugs (NRDs) are less well understood despite their increased administration in clinics as MDR cases climb [1,10]. To address these shortcomings, the Comprehensive Resistance Prediction for Tuberculosis: an International Consortium (CRyPTIC) has collected M. tuberculosis clinical isolates worldwide to survey the genetic variation associated with resistance to 13 antitubercular drugs, specifically the first-line drugs rifampicin, isoniazid, and ethambutol; the second-line drugs amikacin, kanamycin, rifabutin, levofloxacin, moxifloxacin, and ethionamide; and the NRDs bedaquiline, clofazimine, delamanid, and linezolid. Here, we introduce and describe these data in the form of an open-access data compendium of 12,289 isolates, each of which has had its genomic sequence determined and DST profile measured [11]. This compendium is the largest drug screening effort to date for M. tuberculosis in a “one isolate–one microscale assay” format across defined compound concentration ranges. A sampling process was designed to enrich for resistant isolates to account for the variable prevalence of resistance found in different countries and the many rare resistance mutations for several drugs. As a result, the compendium is not suitable for measuring prevalence, or estimating “real-world” error rates of resistance prediction tools; rather, it serves as a resource to accelerate antimicrobial resistance (AMR) diagnostic development by enriching mutation catalogues for WGS resistance prediction, improving our understanding of the genetic mechanisms of resistance, and identifying important diagnostic gaps and drug resistance patterns. Indeed, the consortium has begun to address some of these important issues in recent publications using this very compendium [12–16]. Discussion This compendium of M. tuberculosis clinical isolates is the result of an extensive global effort by the CRyPTIC consortium to better map the genetic variation associated with drug resistance. Through its sheer size and by oversampling for resistance, the compendium gives an unparalleled view of resistance and resistance patterns among the panel of 13 antitubercular compounds studied. This study serves to summarise the data within the compendium and to highlight the existence of the open-access resource to the wider community to help better inform future treatment guidelines and steer the development of improved diagnostics. Starting with first-line drugs, molecular based diagnostic assays have vastly improved the detection of and the speed at which we find DR-TB cases, resulting in improved quality of care for patients. However, relying solely on these diagnostic methods has several drawbacks. Aside from the Xpert MTB/RIF assay potentially increasing false positive MDR diagnoses as discussed earlier in the RMR case study, the assay assumes isoniazid resistance upon detection of rifampicin resistance. Thus, less is known about the prevalence of mono-isoniazid resistance or “true” cases of MDR (confirmed rifampicin and isoniazid resistance) [1] and with large datasets such as this compendium, we can further investigate these important and rarer clinical phenotypes (like that of RMR in our case study). Another example of a rarer phenotype is that of isoniazid-resistant and rifampicin-susceptible (Hr-TB) isolates; a greater number of these were contributed by CRyPTIC countries than RMR isolates (n = 1,470 versus n = 302), a pattern also recently observed in a global prevalence study [38]. A modified 6-month treatment regimen is now recommended for Hr-TB (rifampicin, ethambutol, levofloxacin, and pyrazinamide), and as a result of inadequate diagnosis, many of the 1.4 million global Hr-TB estimated cases would have received inadequate and unnecessarily longer treatment regimens [1,39]. Encouragingly, CRyPTIC isolates with an Hr-TB background exhibited relatively low levels of resistance to other antitubercular drugs, including those in the augmented regimen (Fig 4C). However, without appropriate tools to assess and survey this, we will continue to misdiagnose and ineffectively treat these clinical cases. In 2018, CRyPTIC and the 100,000 Genomes project demonstrated that genotypic prediction from WGS correlates well with culture-based phenotype for first-line drugs, which is reflected in our summary of the genetic catalogue applied to this dataset (Table 3) [8]. While predictions can be made to a high level of sensitivity and specificity, there is still more to learn, as exemplified by the isolates in the compendium that despite being resistant to rifampicin and isoniazid could not be described genetically (Table 2). This shortfall, along with the limitations of molecular based diagnostic assays, highlights the need for continual genetic surveillance and shines a favourable light on a WGS-led approach. A strength of this compendium lies with the data collated for second-line drugs. A greater proportion of drug-resistant isolates had additional resistance to fluoroquinolones than second-line injectable drugs (Fig 4A). This could be due to more widespread use of fluoroquinolones as well as their ease of administration and hence them being recommended over injectables for longer MDR treatment regimens [1]. Concerningly, we found that resistance to levofloxacin and moxifloxacin, and kanamycin and amikacin, were more common than resistance to the mycobacterial specific drug ethambutol in an isoniazid- and rifampicin-susceptible background (Fig 4B), suggesting a level of preexisting resistance to second-line drugs. This concurs with a systematic review that found patients previously prescribed fluoroquinolones were 3 times more likely to have fluoroquinolone-resistant TB [40]. Careful stewardship of fluoroquinolones, both in TB and other infectious diseases, will be paramount for the success of treatment regimens. Despite variability in sample collection, we observed high proportions of fluoroquinolone-resistant MDR/RR isolates from some countries and therefore suggest that MDR treatment regimens could be improved by optimisation on a geographic basis. Further treatment improvement could also be made by the selection of appropriate drugs from each class. For example, The WHO recommends switching from kanamycin to amikacin when treating MDR TB patients [39], and the compendium supports this recommendation as we saw more resistance to kanamycin than amikacin in all phenotypic backgrounds. For fluoroquinolones, more isolates were resistant to levofloxacin than moxifloxacin in all phenotypic backgrounds, suggesting moxifloxacin may by the most appropriate fluoroquinolone to recommend, although we note this conclusion is critically dependent on the validity of the cutoff, here an ECOFF, used to infer resistance. However, the amenability of drugs to catalogue-based genetic diagnostics is also an important consideration, and our data suggest levofloxacin resistance could be predicted more reliably than moxifloxacin, with fewer false positives predicted (Table 2). Testing for fluoroquinolone resistance using molecular diagnostic tests remains limited. Global data from the past 15 years suggest that the proportion of MDR/RR TB cases resistant to fluoroquinolones sits at around 20%, with these cases primarily found in regions of high MDR-TB burden [1]. While recently approved tools, such as the Cepheid Xpert MTB/XDR cartridge, will permit both isoniazid and fluoroquinolone testing to be increased, the same pitfalls are to be encountered regarding targeted diagnostic assays [41]. In contrast, the genetic survey in this study demonstrates the potential of WGS for genetic prediction of resistance to second-line drugs, and studies within the consortium to investigate this are underway. The data compendium has facilitated the first global survey of resistance to NRDs. Reassuringly, prevalence of resistance to the NRDs was substantially lower than for first- and second-line agents in the dataset (Fig 3A), and resistance to the new drugs bedaquiline and delamanid was less common than the repurposed drugs clofazimine and linezolid in an MDR/RR background (Fig 4C). However, the presence of higher levels of delamanid and clofazimine resistance than ethambutol resistance in the isoniazid- and rifampicin-susceptible background does suggest some preexisting propensity towards NRD resistance (Fig 4B). Coresistance between NRDs was seen in isolates in the compendium, the most common being isolates resistant to both bedaquiline and clofazimine. This link is well documented and has been attributed to shared resistance mechanisms such as nonsynonymous mutations in rv0678, which were found in both clofazimine- and bedaquiline-resistant isolates in the compendium [31] (Fig 5B and 5C). Increased clofazimine use could further increase the prevalence of M. tuberculosis isolates with clofazimine and bedaquiline coresistance, limiting MDR treatment options including using bedaquiline as the backbone of a shorter MDR regimen [42]. Therefore, proposed usage of clofazimine for other infectious diseases should be carefully considered. WHO recommends against the use of bedaquiline and delamanid in combination to prevent the development of coresistance, which could occur relatively quickly [43]; the rate of spontaneous evolution of delamanid resistance in vitro has been shown to be comparable to that of isoniazid, and, likewise, bedaquiline resistance arises at a comparable rate to rifampicin resistance [44]. In this compendium, 12.9% of bedaquiline-resistant isolates were resistant to delamanid, and 7.1% of delamanid-resistant isolates were resistant to bedaquiline. Several scenarios could account for this, including the presence of shared resistance mechanisms. For example, as bedaquiline targets energy metabolism within the cell, changes to cope with energy/nutrient imbalances upon the acquisition of resistance-associated ATPase pump mutations may result in cross resistance to delamanid in a yet unknown or unexplored mechanism [13]. It is imperative that genetic determinants of resistance are fully explored for the NRDs, as these are our current treatments of last resort, with special attention given to those mechanisms that could be shared with other agents. In the meantime, careful stewardship and phenotypic and genotypic surveillance of the NRDs should be implemented, including linezolid and clofazimine, which are now group A and B drugs, respectively, for MDR treatment [1]. Several research avenues are being actively explored by the CRyPTIC consortium that make further use of this compendium, including the following: (i) relating genetic mutations to quantitative changes in the MICs of different drugs [13]; (ii) genome-wide association studies [15]; (iii) training machine learning models that can predict resistance [14]; and (iv) exploration of the genetic determinants of resistance to second line and NRDs [16]. Collectively, these studies share the same aim of facilitating the implementation of WGS-directed resistance prediction in the clinic. Finally, we urge other researchers to explore and analyse this large dataset of M. tuberculosis clinical isolates and hope it will lead to a wave of new and inciteful studies that will positively serve the TB community for years to come. Methods Ethics Approval for CRyPTIC study was obtained by Taiwan Centers for Disease Control IRB No. 106209, University of KwaZulu Natal Biomedical Research Ethics Committee (UKZN BREC) (reference BE022/13) and University of Liverpool Central University Research Ethics Committees (reference 2286), Institutional Research Ethics Committee (IREC) of The Foundation for Medical Research, Mumbai (Ref nos. FMR/IEC/TB/01a/2015 and FMR/IEC/TB/01b/2015), Institutional Review Board of P.D. Hinduja Hospital and Medical Research Centre, Mumbai (Ref no. 915-15-CR [MRC]), scientific committee of the Adolfo Lutz Institute (CTC-IAL 47-J / 2017) and in the Ethics Committee (CAAE: 81452517.1.0000.0059) and Ethics Committee review by Universidad Peruana Cayetano Heredia (Lima, Peru) and LSHTM (London, UK), Institutional Review Board at Pham Ngoc Thach Hospital, HCMC, Vietnam and Oxford Tropical Research Ethics Committee, UK, University of the Witwatersrand, Johannesburg Human Research Ethics Committee (Medical) (M160667), Technical Scientific Council (CTC-IAL no. 47-J / 2017) and Research Ethics Committee (CAAE 81452517.1.0000.0059) of Adolfo Lutz Institute, and University off Cape Town Faculty of Health Sciences Research Ethical Committee approvals (HREC 012/2007, 057/2013). The CRyPTIC study involves analysis of microbiological isolates only—there is no associated data on patients. The study aggregates isolates from previous studies (which had previously obtained IRB approval) and also collected its own samples. Each IRB listed above assessed the protocol and saw no need for individual consent as only microbiological isolates were being analysed, and no personally identifiable information or host genetic data was used. In the remaining jurisdictions (IML Gauting (Germany), Public Health Scotland, Public Health Sweden, San Raffaele Scientific Institute, Italy), no IRB approval (and no individual patient consent) was required for studies analysing routinely collected microbiological isolates only. Sample collection This study was designed to identify as many drug resistance mechanisms and mutations as possible. Given that for many drugs there is a long tail of rare mutations present at different frequencies in different countries, sample collection was biased towards collecting resistant isolates, with (wherever possible) temporally and geographically matched susceptibles. Therefore, with 4 exceptions, all collecting sites either sequenced all culturable isolates, a random subsample of all culturable isolates (tailored to budget) or sequenced a subsample of all resistant with matched susceptible samples. The exceptions were as follows: 102 samples from South Africa, which were a clinical cohort recruited on the basis of the health service classifying them as RR; the first 1,000 out of 2,944 samples from Peru were a historical freezer collection with heterogeneous sampling process (and the remainder were sampled prospectively, at random); Brazilian samples combined all stored resistant samples, with prospectively sampled pan-susceptibles; in addition to sampling prospectively and retrospectively (from freezers) enriching for resistance and matched susceptibles, at IML Gauting (Germany) and National Institute for Communicable Diseases (Johannesburg), all isolates resistant to a NRD were included. Chinese isolates were collected according to the following strategy: collecting subsites were randomly selected from the 72 counties, and then all culture-positive isolates were included. A broad breakdown of sampling approaches is included in Table B in S1 File. Plate assay The CRyPTIC consortium designed 2 versions of the Sensititre MYCOTB plate (Thermo Fisher Scientific, USA) named the “UKMYC5” and “UKMYC6” microtitre plates [11,12]. These plates contain 5 to 10 doubling dilutions of 13 antibiotics (rifampicin (RIF), rifabutin (RFB), isoniazid (INH), ethambutol (EMB), levofloxacin (LEV), moxifloxacin (MXF), amikacin (AMI), kanamycin (KAN), ethionamide (ETH), clofazimine (CFZ), linezolid (LZD), delamanid (DLM), and bedaquiline (BDQ)). DLM and BDQ were provided by Otsuka Pharmaceutical and Janssen Pharmaceuticals, respectively. The UKMYC5 plate also contained para-aminosalicylic acid (PAS), but the MICs were not reproducible, and, hence, it was excluded from the UKMYC6 plate design and is not included in any subsequent analysis [11]. A standard operating protocol for sample processing was defined by CRyPTIC as previously described [11,12]. Clinical samples were subcultured using 7H10 agar plates, Lowenstein–Jensen tubes, or MGIT tubes. Bacterial cell suspensions (0.5 McFarland standard, saline Tween) prepared from (no later than) 14-day-old colonies were diluted 100X in 10 ml enriched 7H9 broth prior to plate inoculation. A semiautomated Sensititre Autoinoculator (Thermo Fisher Scientific, USA) was used to inoculate 100 μl prepared cell suspensions (1.5 × 105 CFU/ml [5 × 104 CFU/ml—5 × 105 CFU/ml]) into each well of a UKMYC5/6 microdilution plate. The plate was sealed and incubated for 14 days at 37°C. Quality control runs were performed periodically using M. tuberculosis H37Rv ATCC 27294, which is sensitive to all drugs on the plates. Minimum inhibitory concentration (MIC) measurements MICs for each drug were read after incubation for 14 days by a laboratory scientist using a Thermo Fisher Sensititre Vizion digital MIC viewing system [11]. The Vizion apparatus was also used to take a high contrast photograph of the plate with a white background, from which the MIC was measured again using the Automated Mycobacterial Growth Detection Algorithm (AMyGDA) software [45]. The AMyGDA algorithm was specifically developed to automate and perform quality control of MIC measurements and to facilitate machine learning studies within the consortium. AMyGDA detects the boundaries of each well using a Hough transform for circles and measures growth as the number of dark pixels within the area contained by this boundary. All images where the MICs measured by Vizion and AMyGDA were different were uploaded to a citizen science project, BashTheBug, on the Zooniverse platform [46]. Each image was then classified by ≥11 volunteers and the median classification taken. MICs were then classified as high (at least 2 methods concur on the MIC), medium (either a scientist recorded a MIC measurement using Vizion but did not store the plate picture, or Vizion and AMyGDA disagree and there is no BashTheBug measurement), or low (all 3 methods disagree) quality. To ensure adequate data coverage for this study, we took the MIC from the Vizion reading provided by the trained laboratory scientist if it was annotated as having medium or low quality. Binary phenotype classification Binary phenotypes (resistant/susceptible) were assigned from the MICs by applying epidemiological cutoff (ECOFF) values [12]; samples with MICs at or below the ECOFF are, by definition, wild-type and hence assigned to be susceptible to the drug in question [12]. Samples with MICs above the ECOFF are therefore classified as resistant (Fig A and Table C in S1 File). Please see [12] for the body of work supporting the use of the ECOFF relative to the compendium isolates and Table C in S1 File for the ECOFFs for each drug tested. Genomic data processing and variant calling Clinical samples were subcultured either using Lowenstein–Jensen tubes, 7H10 agar plates, or MGIT tubes for (no more than) 14 days prior to DNA extraction using either the FastPrep-24 instrument (MP Biomedicals) for cell lysis and ethanol precipitation or the cetyltrimethylammonium bromide (CTAB) method. Paired-end libraries were prepared using a Nextera XT DNA sample preparation kit (Illumina, San Diego, CA, USA) and were sequenced on Illumina instruments. The resulting FASTQ files were processed using the bespoke pipeline Clockwork (v0.8.3, github.com/iqbal-lab-org/clockwork; [47]). Briefly, all raw sequencing files were indexed into a relational database with which Clockwork proceeds. Human, nasopharyngeal flora, and human immunodeficiency virus–related reads were removed, and remaining reads were trimmed (adapters and low-quality ends) using Trimmomatic and mapped with BWA-MEM to the M. tuberculosis H37Rv reference genome (NC000962.3) [48,49]. Read duplicates were removed. Genetic variants were called independently using Cortex and SAMtools, 2 variant callers with orthogonal strengths (SAMtools, a high-sensitivity SNP caller, and Cortex, a high-specificity SNP and indel caller) [50,51]. The 2 call sets were merged to produce a final call set, using the Minos adjudication tool (v0.11.0) to resolve loci where the 2 callers disagreed, by remapping reads to an augmented genome containing each alternative allele [24]. Default filters of a minimum depth of 5×, a fraction of supporting reads of 0.9 (Minos) and a genotype confidence percentile (GCP) filter of 0.5 were applied. The GCP filter is a normalised likelihood ratio test, giving a measure of confidence in the called allele compared with the other alternatives, and is described in [24]. This produced one variant call format (VCF) file per sample, each only describing positions where that sample differed from the reference. These filtered VCFs were then combined to produce a single nonredundant list of all variants seen in the cohort. All samples were then processed a second time with Minos, remapping reads to a graphical representation of all the segregating variation within the cohort, generating VCF files that had an entry at all variable positions (thus for all samples, most positions would be genotyped as having the reference allele). These “regenotyped” VCFs were later used to calculate pairwise distances (see below). Please refer to Supplemental Method A in S1 File for commands used to generate the per-sample and regenotyped VCF files. To remove untrustworthy loci, a genome mask was applied to the resulting VCF files (regions identified with self-blast matches in [52] comprising of 324,971 bp of the reference genome). Furthermore, positions with less than 90% of total samples passing default Clockwork/Minos variant call filters (described above) were filtered out, comprising 95,703 bp of the genome, of which 55,980 bp intersect with the genome mask. Resistance prediction using a genetic catalogue A hybrid catalogue of genetic variants associated with resistance to first- and second-line drugs based on existing catalogues was created and can be found at github.com/oxfordmmm/tuberculosis_amr_catalogues/blob/public/catalogues/NC_000962.3/NC_000962.3_CRyPTIC_v1.311_GARC1_RUS.csv [8,26]. We specifically did not use the recent WHO catalogue to avoid circularity and overtraining, as that catalogue was developed (via prior literature, expert rules, and a heuristic algorithm) based partially on these isolates [25]). The resulting VCF file for each isolate (see “Genomic data processing and variant calling” section above) was compared to the genetic catalogue to determine the presence or absence of resistance-associated mutations for 8 drugs: RIF, INH, EMB, LEV, MXF, AMI, KAN, and ETH. We did not apply the approach used in [8] to make a prediction if a novel mutation was detected in a known resistance gene, as we simply wanted to measure how well a pre-CRyPTIC catalogue could predict resistance in the compendium. These results (found in PREDICTIONS.csv; see “Data availability” section for access) were then compared to the binary phenotypes (see “Binary phenotype classification” section for how these were defined) with the following metrics calculated: TP, the number of phenotypically resistant samples that are correctly identified as resistant (“true positives”); FP, the number of phenotypically susceptible samples that are falsely identified as resistant (“false positives”); TN, the number of phenotypically susceptible samples that are correctly identified as susceptible (“true negatives”); FN, the number of phenotypically resistant samples that are incorrectly identified as susceptible (“false negative”); VME, very major error rate (false-negative rate), 0 to 1; ME, major error rate (false-positive rate), 0 to 1; PPV, positive predictive value, 0 to 1; NPV, negative predictive value. Phylogenetic tree construction A pairwise genetic distance matrix was constructed for 15,211 isolates by comparing pairs of regenotyped VCF files (see “Genomic data processing and variant calling” section above for more details). A neighbourhood-joining tree was constructed from the distance matrix using quicktree [53]. Tree visualisation and annotation were performed using the R library ggtree [54]. M. tuberculosis lineages were assigned using Mykrobe and are represented by the coloured dots at the branch termini of the tree [24]. For isolates that had “mixed” lineage classification (i.e., 2 lineages were found present in the sample by Mykrobe, n = 225, 1.5%), the first of the 2 lineages was assigned to the isolate. ggtree was also used to construct the trees depicting BDQ-, CFZ-, and DLM-resistant isolates. The data Data are available from ftp.ebi.ac.uk/pub/databases/cryptic/release_june2022/. The FTP site contains 2 top level directories: “reuse” and “reproducibility”. All data for this study were analysed and visualised using either R or python3 libraries and packages. See github.com/kerrimalone/Brankin_Malone_2022 for codebase. “reuse” directory We point the reader to this directory to gain access to CRyPTIC project data. “CRyPTIC_reuse_table_20221019.csv” contains genotypic and phenotypic data relating to the figures and summaries listed in this manuscript and is what we present as a general use reference table for most future projects. It includes binary phenotypes (R/S), MICs, phenotype quality metrics, and ENA sample IDs for 12,288 compendium isolates (see “Quality assurance of the minimum inhibitory concentrations for 13 drugs” section below in Results for filters applied to obtain this final set of isolated). It also includes file paths to each isolate’s VCF file and “regenotyped” VCF file (VCF files that have an entry at all variable positions; see “Genomic data processing and variant calling” section above for more). “CRyPTIC_excluded_samples_20220607.tsv” contains the ENA accession numbers and file paths to each isolate’s VCF file and “regenotyped” VCF file for the 2,922 samples that were sequenced but have no relative MIC data. “reproducibility” directory This directory contains the data used for multiple CRyPTIC project publications referenced throughout this manuscript. As stated above, each project has taken slightly different subsets of these data as documented in those papers. For example, see how tables such as “MUTATIONS.csv” and “GENOTYPES.csv” were used and filtered, (along with others) in this study to obtain the reuse file “CRyPTIC_reuse_table_20221019.csv” in Fig 1. Again, for optimal use of CRyPTIC data in your own project, please refer to “CRyPTIC_reuse_table_20221019.csv” in the “reuse” directory. All data for this study were analysed and visualised using either R or python3 libraries and packages. See github.com/kerrimalone/Brankin_Malone_2022 for codebase. Author Contributions Conceptualisation: Daniela M Cirillo, Derrick W. Crook, Philip W Fowler, Sarah Hoosdally, Ana Lúıza Gibertoni Cruz, Nazir A. Ismail, Stefan Niemann, Zamin Iqbal, Tim E.A. Peto, A Sarah Walker, Timothy M Walker Data Curation: Philip W Fowler, Sarah J Hoosdally, Ana Lúıza Gibertoni Cruz, Alice Brankin, Kerri M. Malone, Zamin Iqbal, Martin Hunt and Jeff Knaggs Formal Analysis: Alice Brankin, Kerri M. Malone Funding acquisition: Camilla Rodrigues, David Moore, Derrick W. Crook, Daniela M. Cirillo, Zamin Iqbal, Nazir A. Ismail, Nerges Mistry, Stefan Niemann, Tim E.A. Peto, Guy Thwaites, A. Sarah Walker, Timothy M Walker, Daniel J. Wilson Investigation: Alice Brankin, Kerri M. Malone Project Administration: Daniela M. Cirillo, Derrick W Crook, Philip W Fowler, Sarah Hoosdally, Zamin Iqbal, Tim E.A. Peto, Aysha Roohi, Resources: The CRyPTIC Consortium Software: Philip W Fowler, Martin Hunt, Jeff Knaggs, Brice Letcher Supervision: Daniela M. Cirillo, Derrick W Crook, Philip W Fowler, Zamin Iqbal, Tim E.A. Peto, Daniel J. Wilson Validation: Emanuele Borroni, Daniela M Cirillo, Philip W. Fowler, Clara Grazian, Sarah J. Hoosdally, Martin Hunt, Timothy E. A. Peto, Paola M. V. Rancoita Visualisation: Alice Brankin, Kerri M. Malone Writing - original draft preparation: Alice Brankin, Kerri M. Malone Writing - review and editing: Alice Brankin, Kerri M. Malone, Philip W. Fowler, Zamin Iqbal Supporting information S1 File. Supporting information. Text A. Acknowledgements. Text B. Lineages of the M. tuberculosis isolates of the compendium. Table A. Acronyms used in this manuscript. Table B. Sampling strategies at different collection sites. Table C. Epidemiological cutoff values (ECOFFs) used to binarize MIC measurements into resistant and susceptible. Table D. Lineages–v- geographical location of origin/contribution for CRyPTIC isolates. Table E. Sublineages–v- geographical location of origin/contribution for CRyPTIC isolates. Table F. Sample information for isolates classified as resistant to all 13 drugs tested. Table G. Co-occurrence of antibiotic resistance in CRyPTIC M. tuberculosis isolates. Supplemental Method A. Generating per-sample and regenotyped VCF files. Fig A. Per-drug MIC distributions of isolates plated on CRyPTIC designed variations on the Thermo Fischer Sensititre MYCOTB MIC plate; UKMYC5 (A) and UKMYC6 (B). Fig B. Geographical distribution of 15,211 CRyPTIC M. tuberculosis clinical isolates. Fig C. A significant association between country and lineage can be seen in the CRyPTIC data. Fig D. Phylogenetic tree of CRyPTIC M. tuberculosis clinical isolates. Fig E. Nonsynonymous mutations found outside the RRDR of rpoB in RMR isolates and MDR isolates. https://doi.org/10.1371/journal.pbio.3001721.s001 (DOCX) Acknowledgments We thank Faisal Masood Khanzada and Alamdar Hussain Rizvi (NTRL, Islamabad, Pakistan), Angela Starks and James Posey (Centers for Disease Control and Prevention, Atlanta, USA), and Juan Carlos Toro and Solomon Ghebremichael (Public Health Agency of Sweden, Solna, Sweden), Iñaki Comas and Álvaro Chiner-Oms (Instituto de Biología Integrativa de Sistemas, Valencia, Spain; CIBER en Epidemiología y Salud Pública, Valencia, Spain; Instituto de Biomedicina de Valencia, IBV-CSIC, Valencia, Spain). We thank the Wadsworth Center Applied Genomic Technologies Core Facility and the Wadsworth Center Bioinformatics Core for additional support for sequencing and analysis and SYNLAB Holding Germany GmbH for its direct and indirect support of research activities in the Institute of Microbiology and Laboratory Medicine Gauting. N.R. thanks the Programme National de Lutte contre la Tuberculose de Madagascar. [END] --- [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/