(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . The pan-genome of Aspergillus fumigatus provides a high-resolution view of its population structure revealing high levels of lineage-specific diversity driven by recombination [1] ['Lotus A. Lofgren', 'Department Of Microbiology', 'Plant Pathology', 'University Of California Riverside', 'Riverside', 'California', 'United States Of America', 'Department Of Biology', 'Duke University', 'Durham'] Date: 2022-11 Aspergillus fumigatus is a deadly agent of human fungal disease where virulence heterogeneity is thought to be at least partially structured by genetic variation between strains. While population genomic analyses based on reference genome alignments offer valuable insights into how gene variants are distributed across populations, these approaches fail to capture intraspecific variation in genes absent from the reference genome. Pan-genomic analyses based on de novo assemblies offer a promising alternative to reference-based genomics with the potential to address the full genetic repertoire of a species. Here, we evaluate 260 genome sequences of A. fumigatus including 62 newly sequenced strains, using a combination of population genomics, phylogenomics, and pan-genomics. Our results offer a high-resolution assessment of population structure and recombination frequency, phylogenetically structured gene presence–absence variation, evidence for metabolic specificity, and the distribution of putative antifungal resistance genes. Although A. fumigatus disperses primarily via asexual conidia, we identified extraordinarily high levels of recombination with the lowest linkage disequilibrium decay value reported for any fungal species to date. We provide evidence for 3 primary populations of A. fumigatus, with recombination occurring only rarely between populations and often within them. These 3 populations are structured by both gene variation and distinct patterns of gene presence–absence with unique suites of accessory genes present exclusively in each clade. Accessory genes displayed functional enrichment for nitrogen and carbohydrate metabolism suggesting that populations may be stratified by environmental niche specialization. Similarly, the distribution of antifungal resistance genes and resistance alleles were often structured by phylogeny. Altogether, the pan-genome of A. fumigatus represents one of the largest fungal pan-genomes reported to date including many genes unrepresented in the Af293 reference genome. These results highlight the inadequacy of relying on a single-reference genome-based approach for evaluating intraspecific variation and the power of combined genomic approaches to elucidate population structure, genetic diversity, and putative ecological drivers of clinically relevant fungi. Funding: LAL, JES, and RAC are supported by funding from the National Institutes of Health (grant no. R01AI130128), to RAC, with additional support to LAL and JES from a UC Riverside-City of Hope seed grant to JES. JES is a CIFAR Fellow in the program Fungal Kingdom: Threats and Opportunities and was supported by funding from National Science Foundation (grants no. DEB-1441715 and DEB-1557110). All analyses and data management was performed on the High-Performance Computing Center at the University of California, Riverside, in the Institute of Integrative Genome Biology, supported by grants from NSF (DBI-1429826) and NIH (S10-OD016290). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: All newly sequenced genomes have been deposited into GenBank: accession numbers for each sequencing project can be found in S1 Table . All programing scripts associated with this project are available at: https://github.com/MycoPunk/Afum_PopPan . The definitive version of the pipeline used for VCF generation can be accessed at https://github.com/stajichlab/PopGenomics_Afumigatus_Global . All data associated with this project are available from DOI: 10.5281/zenodo.5775265 . A sexual cycle has been documented in A. fumigatus [ 36 ]. However, the frequency of recombination events and their capacity to shape A. fumigatus populations is unknown. Whereas widely distributed generalist species with large effective population sizes are assumed to have larger pan-genomes, frequent clonal reproduction is thought to limit pan-genome size [ 21 , 43 ]. Generally, the evolution of large pan-genomes implies frequent gene flow, which may take the form of sexual or pseudo-sexual exchange, or significant horizontal gene transfer [ 44 ]. In open pan-genomes, the ratio of core/accessory genes is expected to decrease with an increasing number of genomes analyzed. Conversely, closed pan-genomes will quickly reach a saturation point at which adding new genomes to the analysis adds few new genes to the total pool. It is unclear how a species like A. fumigatus, which demonstrates ubiquitous geographic distribution, assumed substrate generalism, primary clonal reproduction, but potentially high recombination rates [ 45 ], fits into these expectations. Because A. fumigatus is thought to reproduce primarily clonally (and rarely sexually) [ 36 ], population structure is likely to be complicated by local clonal population bursts followed by dispersal. One well-studied example of this is the movement of cyp51A resistance alleles across the world. The cyp51A gene encodes 14-alpha sterol demethylase and is the primary drug target for azole antifungals [ 37 ]. Environmental pressures for allelic variation at this critical azole target are hypothesized to come primarily from the widespread use of agricultural antifungals [ 38 ]. Human-to-human transmission was originally thought to represent a dead-end for the species; however, recent evidence support occasional hospital-related transmission [ 39 , 40 ] and the possibility for clinical antifungal treatment to represent a secondary source of antifungal pressure. Although the cyp51A mutation has been found in multiple genetic backgrounds [ 30 ], the distribution of drug-resistant isolates across the phylogeny is nonrandom, with resistant strains showing low levels of genetic diversity and close genetic relatedness, likely representing high levels of clonal reproduction and a selective sweep for resistant phenotypes [ 32 , 34 ]. Although cyp51A mutations are the best studied drug resistance mechanism in A. fumigatus, mutations in cyp51A only account for an estimated 43% of resistant isolates [ 41 ]. While many other genes have been implicated in azole resistance [ 42 ], it is unknown whether mutations in these genes are similarly structured by phylogeny. In addition to its importance as a pathogen of humans and other animals [ 22 – 24 ], A. fumigatus is an ubiquitous plant litter saprophyte and plays a substantial ecological role in carbon and nutrient cycling, enabled by the wide range of carbohydrate-active enzymes encoded in the genome and involved in the decay of organic matter [ 2 , 25 ]. The fungus primarily reproduces asexually, via the production of prolific, stress-resistant, hydrophobic conidia [ 26 ]. This combination of traits contribute to the high dispersibility of the species, where conidia are rendered airborne by the slightest wind currents or easily carried to new locations by water, swarming soil bacteria, and soil invertebrates [ 27 ]. Due to this exceptional dispersibility and the assumption of extreme substrate generalism, A. fumigatus was originally thought to represent a single homogenous population [ 28 , 29 ]. However, recent investigations into A. fumigatus population structure have found mixed evidence for population stratification with results heavily dependent on the methods used for analysis and the number of isolates under consideration [ 30 – 33 ]. Interestingly, previous studies have shown little to no correlation of population structure with geography [ 32 , 34 ], as genetically identical (clonal) isolates collected from disparate locations across the globe [ 32 ]. This lack of geographic structure would seem to support a model of panmixia, but geography is not the only potential force structuring fungal populations, which must adapt to a plethora of environmental stressors and niche opportunities. However, niche specificity and the potential for environmental drivers to structure population stratification in A. fumigatus have yet to be investigated. Given its primary ecological strategy as a plant litter saprophyte, we hypothesized that population structure in A. fumigatus would be underpinned by metabolic specificity, with population-specific genomic variation concentrated in genes relevant to nutrient mining. Such metabolic variation may have evolved to access different plant substrates, but have important implications for human and animal disease [ 35 ]. The concept of a pan-genome, here defined as all genetic elements present across a species [ 14 ], recognizes that while many genes are fixed within a population and present in all individuals (core genes), others display substantial presence–absence variation (dispensable or accessory genes). Core genes are expected to be enriched in housekeeping functions and clustered in protected areas of chromosomes, where they are conserved by stronger efficiency of purifying selection [ 15 ]. Conversely, accessory genes are more likely to encode lineage-specific proteins that facilitate environmental adaptation [ 16 ], such as genes involved in secondary metabolism and niche specificity and tend to be concentrated in highly variable, rapidly evolving subtelomeric regions and adjacent to transposable elements (TEs) [ 17 – 20 ]. Variation in the accessory genome is essential to facilitating diversifying selection and adaptation to population-specific environmental pressures and may be particularly important for fungal pathogen adaptation [ 15 , 21 ]. Aspergillus fumigatus is one of the most common etiological agents of human fungal disease [ 1 , 2 ]. The spectrum of diseases attributed to A. fumigatus is remarkable. In immunocompromised patients, invasive A. fumigatus infection causes up to 90% mortality, even with aggressive treatment [ 3 ], an outcome that is further complicated by the increasing presence of triazole-resistant strains [ 4 ]. Phenotypic heterogeneity in growth and virulence is well documented in A. fumigatus [ 5 – 9 ] and thought to be partially structured by genetic variation between strains [ 10 ]. These intraspecific genetic differences likely represent both gene variants (insertions/deletions and single nucleotide polymorphisms) [ 11 ] and differences in gene presence–absence, copy number, and structural arrangements [ 12 , 13 ]. While population-genomic analyses based on reference genome alignment have provided valuable insights into how gene variants are distributed across populations, these approaches fail to capture intraspecific variation in genomic regions absent from the reference genome. In contrast to reference-based genomics, pan-genomic analyses based on de novo assemblies offer the potential to address the full genetic repertoire of a species. For A. fumigatus pathogenesis and virulence, identification of strain-specific genes and alleles is expected to further clarify the role of fungal genetic variation in disease outcomes. Methods DNA preparation All strains sequenced in this project were isolated on Aspergillus minimal medium [46] by picking single germinated spores after 16 h of growth at 30°C. Cultures were grown for DNA extraction in liquid minimal media with 1% (w/v) D-Glucose, 0.5% yeast extract (w/v, Beckton-Dickinson), 20 ml 50× salt solution, 1 ml trace elements solution, 20 mM NaNO 3 , pH adjusted to 6.5 using NaOH, and autoclaved for 20 min at 121°C. Genome sequencing After 24 h of growth, cultures were lyophilized for approximately 10 h and homogenized using a bead beater (1 min with 2.3-mm beads). DNA was extracted using a LETS buffer protocol [47] modified with the addition of a 1-h RNAse treatment prior to phenol–chloroform extraction. DNA concentration was quantified using a Qubit 2.0 Fluorometer (Invitrogen) with the Broad Range protocol. Genomic sequencing was carried out on either Illumina NovaSeq 6000 or NextSeq 500 machines. DNA sequencing libraries were prepared using either the NEBNext Ultra II DNA Library Prep Kit (for NovaSeq 6000 sequenced genomes) or the SeqOnce DNA library kit utilizing Covaris mechanized shearing (for the NextSeq 500 sequenced isolates), both following manufacturer recommendations with paired end library construction and barcoding for multiplexing. All genome sequencing data generated for this project was deposited into the NCBI Sequence Read Archive under BioProject no. PRJNA666940. Strain selection In addition to the 62 strains newly sequenced for this study, a large library of strains (approximately 220) previously published by our lab and others were downloaded from NCBI’s SRA. These strains were initially assessed for genome completeness after de novo assembly (see De-Novo Assembly and Annotation methods below), using BUSCO v4.0.5 [48] and coverage (read depth) using BBTools (https://jgi.doe.gov/data-and-tools/bbtools/) as well as phylogenetic diversity (visual inspection of redundant strains likely to be clonal—see Phylogenomics below for tree building methods). We then excluded strains with an average read depth <10×, or BUSCO complete scores <96, or with negligible branch lengths. Out of the approximately 220 strains initially downloaded, 197 strains were retained for analysis (S1 Table). These were combined with the 62 newly sequenced strains and the re-sequenced reference strain AF293, which was included as a control for variant analysis conducted relative to the Af293 v.55 curated reference downloaded from FungiDB. This resulted in a total of 260 quality-filtered strains. De-novo assembly and annotation To minimize methodological errors introduced by analyzing genomes assembled and annotated using different methods [49], we ensured uniformity by assembling and annotating all 260 strains using a custom pipeline that starts with raw reads, regardless of whether assembly and annotation data were available for strains with previously released genomes. Genomes were assembled de novo with the Automatic Assembly For The Fungi (AAFTF) v. 0.2.3 (https://github.com/stajichlab/AAFTF, DOI: 10.5281/zenodo.1620526). The AAFTF filter step trim and quality filter reads with BBmap (https://sourceforge.net/projects/bbmap) and the AAFTF assemble step was used to assemble the reads with SPAdes [50]. Resulting contigs were screened for bacterial contamination using the AAFTF sourpurge step that relies on sourmash [51] searching a database of Genbank microbial sequence sketches (v.lca-mark2; https://osf.io/vk4fa/). Duplicates were removed with AAFTF rmdup by aligning contigs to themselves with minimap2 v. 2.17 [52], and contigs were further polished for accuracy using the AAFTF polish step that relies on Pilon v. 1.22 [53] and BWA v. 0.7.17 [54] to align raw reads to the contigs and polish a consensus sequence. Scaffolds from contigs were inferred by aligning to the reference Af293 genome using ragtag v. 1.0.0 [55]. Scaffolding to Af293 (currently the most complete reference genome for A. fumigatus) was carried out to improve scaffold length, but contigs failing to map to the reference were retained. Repeat regions were masked by funannotate mask (https://github.com/nextgenusfs/funannotate, DOI: 10.5281/zenodo.1134477) using RepeatMasker v. 4-1-1 [56] with repeats built using RepeatModeler [57] compiled into a custom library (available in this the project’s GitHub) plus those fungal repeat families curated in RepBase v. 20170127 [58]. The train command in funannotate was used to run Trinity [59] for transcript assembly and alignments to generate highly polished gene models based on splice-site aware alignment of the assembled RNA-Seq of A. fumigatus growth on sugarcane bagasse (PRJNA376829) [60]. Models were further refined by PASA v. 2.3.3, and the best set with full open reading frames was chosen for input in training gene predictors [61]. Gene prediction was carried out using funannotate predict running Augustus v. 3.3.3 [62] and SNAP v. 2013-11-29 [63] by first training on the evidenced-based training models from the train step. Gene prediction was then run with these trained parameters using exon evidence based on RNA-seq and protein alignments generated by DIAMOND v. 2.0.2 [64] and polished with Exonerate v. 2.4.0 [65] on the RNA-seq and Swissprot proteins [66]. Additional ab initio models were predicted using GeneMark v. 4.59 [67], GlimmerHMM v. 3.0.4 [68], and CodingQuarry v. 2.0 [69], the latter also used the raw RNA-seq evidence for exon prediction directly. A set of consensus gene models were produced from these predictions with EVidenceModeler v. 1.1.1 [70]. The tool funannotate annotate was run to annotate protein domains and make functional predictions using sequence similarity with the databases InterProScan v. 5.45–80.0 [71], eggNOG v. 1.0.3 [72], dbCAN2 v. 9.0 [73], UniprotDB v. 2020_04, antiSMASH v. 5.1.2 [74], and MEROPS v. 12.0 [75]. Pan-genomics We used 2 methods to determine pan-genome gene family clusters: OrthoFinder v. 2.5.2 [76] and Pangenome Iterative Refinement and Threshold Evaluation (PIRATE) v. 1.0.4 [77]. The decision to assess the A. fumigatus pan-genome twice using 2 different methods was made to confirm the reproducibility of gene family counts after initial assessment yielded a surprisingly high number of orthogroups. OrthoFinder was run on protein fastA files with -op option to run similarity searches as parallel jobs on the HPC and the -S diamond_ultra_sens option for sequence search using DIAMOND. PIRATE was run on nucleotide fastA files, with parameters: -s “85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100” -k “—cd-low 100 -e 1E-9—hsp-prop 0.5” -a -r. Pan-genome clustering results were analyzed in the R programing environment using custom scripts (all scripts are available at the DOI listed in the data availability section). For our analysis, “core” genes were defined as present in >95% of the strains (n > 247), “accessory genes” were defined as present in more than 1, and less than 95% of the strains (n > 1 and < = 247) and singletons were defined as present in only a single isolate. Singletons and accessory gene families, collectively referred to here as “dispensable genes,” were analyzed separately to account for the possibility that singletons were more error prone or less likely to contain functional information. Significant differences in the abundance of accessory and singleton gene families per clade were assessed using a permutation test implemented in the R package perm [78] over 9,999 permutations to account for violation of variance assumptions. Gene family accumulation curves were calculated using the specaccum() function over 100 iterations in R package vegan v. 2.5.7 [79]. Analysis of clade-specific gene family absence was defined as a gene family absent in all isolates of that clade, but present in both of the other clades and in >90% of the isolates from at least one of those clades. Additional screening for clade-defining gene family absences was also conducted, defined as absent in all isolates of that clade, but present in >95% of all isolates from both of the other clades. Secreted proteins were predicted using the programs Signal P5 [80] and Phobius [81]. Phobius predictions were further refined by both the presence of a signal peptide and the absence of transmembrane domains. Significant differences in proportions of secreted gene families relative to all gene families were assessed using pairwise proportion tests with Bonferroni adjustment for multiple comparisons using function pairwise.prop.test() in the R stats package at p < 0.05. Functional analysis Gene Ontology (GO) enrichment analysis was performed by assigning Interpro and GO annotations to the longest representative of each gene family clustered in OrthoFinder. GO enrichment analysis was then performed for each clade, on all gene families unique to that clade using the R package topGO v. 2.42.0 with the functions new() and runTest() with parameters weight01 and nodeSize = 6 (DOI: 10.18129/B9.bioc.topGO). Significant enrichment was assessed for each gene count category (core, accessory, singleton) and for clade-specific gene families relative to the set of all InterPro annotations with associated GO terms in the A. fumigatus pan-genome (n = 32,857), using a Fishers Exact test at p < 0.05. CAZyme (Carbohydrate-Active enZyme) annotations were assigned in funannotate that relies on hmmsearch to search the dbCAN database for HMM profiles [73]. Homogeneity of variance assumptions were checked using the leveneTest() function from the car package v. 3.0.11 and normalcy checked using the shapiro.test() function from the stats package v. 4.0.1 in R. Because variance and normalcy assumptions could not be met, nonparametric Kruskal–Wallis tests were employed using the kruskal.test() function in the R stats package v. 4.0.1 with Bonferroni adjustment for multiple comparisons using the p.adjust() function and considered for further evaluation at p < 0.001. Post-hoc comparisons using Dunn’s test were implemented using the dunnTest() function with Bonferroni adjustment for multiple comparisons in the R package FSA v. 0.9.0. CAZymes significantly different between the 3 clades were normalized per CAZyme family on a 0 to 1 scale for visualization and mapped onto the phylogeny using the gheatmap() function in the R package ggtree v.3.1.2. To validate our OrthoFinder clustered gene families, we represented each gene family with the longest sequence and used BLASTP v. 2.12.0 (e-val < 1e-15) to match gene families to their respective genes annotated in the Af293 reference genome. To further investigate the presence–absence distribution of accessory genes with a role in nitrogen, carbohydrate, and phosphorus metabolism, we extracted all Af293 genes from FungiDB with GO annotations that matched organonitrogen compound metabolic process (GO: 1901564, n = 1,775 genes), carbohydrate metabolic process (GO: 0005975, n = 467 genes), or organophosphate biosynthetic process (GO: 0090407, n = 215 genes). We then examined gene families with orthologues in Af293 to test for presence–absence variation in these genes between the 3 clades. Significant differences in gene abundance between the 3 clades were evaluated using the same strategy outlined above to evaluate CAZyme abundance. Spatial enrichment To identify the location of core accessory and singleton gene families relative to telomeres, we calculated the enrichment of each gene family abundance category over 50 kb windows First,.gff files were converted to bed format and filtered to consider only contigs >50 kb in length and containing telomeric repeats (TAAC or the reverse complement) as developed by [82]. We then calculated the number of gene families contained either within or outside 50 kb of telomeric repeats using PyBedTools [83]. To test for enrichment of gene family abundance within 50 kb of telomeres, we checked normalcy assumptions using the shapiro.test() function in the stats package v. 4.0.1 in R, and as normalcy assumptions could not be met, used a 2-sided Wilcoxon rank sum test implemented in the stats package to test for significance. Population genomics The Af293 reference genome was downloaded from FungiDB (v.46) [84]. Sequence reads for each strain were aligned to Af293 using BWA v. 0.7.17 and the alignment file processed with samtools v. 1.10 [85], applying the fixmate and sort commands to convert files to the BAM format. Duplicate reads were removed, and reads were flagged using MarkDuplicates and indexed with Build BamIndex in picard tools v.2.18.3 (http://broadinstitute.github.io/picard). Variants (SNPs) were called relative to Af293 using HaplotypeCaller in GATK v.4.0 [86] with filtering accomplished using GATK’s VariantFiltration, with parameters -window-size = 10, -QualByDept <2.0, -MapQual <40.0, -Qscore <100, -MapQualityRankSum <-12.5, -StrandOddsRatio > 3.0, -FisherStrandBias >60.0, -ReadPosRankSum <-8.0. Only variants passing these filers were retained using the SelectVariants tool in GATK. Variants overlapping TEs were further excluded from the variant pool by testing for overlap with the TE locations identified in the FungiDB v. 46 release of the Af293 reference genome, using bedtools subtract [87]. Variants were annotated with snpEff [88] based on the GFF annotation for Af293 v. 46 from FungiDB. Population structure Prior to analysis, the presence of clones in the dataset was assessed using the R package poppr v.2.9.0 [89] with the functions mlg() and clonecorrect(). No clones were detected. Initially, we assessed broad-scale population structure in A. fumigatus using the Bayesian clustering approach STRUCTURE implemented in fastStructure v. 1.0 [90] across 361,717 polymorphic sites (single nucleotide polymorphisms). VCF files were first converted into the plink format using PLINK v. 1.90b3.38 [91] before running fastSTRUCTURE using the simple prior with K values ranging from 1 to 15 over 30 independent iterations with specified seed values. For each independent iteration, marginal likelihood values of K were obtained by employing the chooseK.py function in fastSTRUCTURE and assessed for significant increases in marginal likelihood using ANOVA and the R package multcomp [92] with post hoc testing using Tukey tests and Bonferroni correction for multiple comparisons. To further assess population structure, we used discriminate analysis of principle components (DAPC) [93] and subsequent clade mapping onto the phylogeny. DAPC was implemented in the R package adegnet v. 2.1.3 [94] on a random subset of 100 K polymorphic SNP sites. The optimal number of groups (K) was identified based on Bayesian information criterion (BIC) score using the find.clusters() function in adegnet, evaluating a possible range of 1 to 15 clusters to identify the elbow of the BIC curve following [93]. To avoid overfitting, the optimal number of PCs retained in the DAPC was chosen using the optim.a.score() function and determined to be PC = 3 out of a possible PC range of 1 to 200. Evaluation of population substructure was carried out by iteratively running DAPC and fastSTRUCTURE on the 3 primary clades identified above. To do this, clade-specific VCF files were subset using VCFtools—keep [95], and invariant sites were removed using the bcftools -view command [96], before running the pipeline as above. To further investigate the history of gene flow in A. fumigatus, we ran TreeMix [97]. TreeMix was run on a VCF file containing all strains plus A. fischeri as the outgroup on a VCF file prepared as above except with the addition of A. fischeri. Initially, the assembly for A. fischeri strain NRRL 181 genome was downloaded from FungiDB v. 55. The assembly was then used to simulate Illumina paired-end reads with the wgsim tool in the samtools package with an error rate of 0 and to a read depth approximately equal to the average for the A. fumigatus strains, and run as above. To run TreeMix, we removed invariant sites from the VCF as above, as well as sites with high LD using Plink v.2 with the parameter—indep-pairwise 50 10 0.2. To determine the optimal number of migration edges (m), we used the program OptM [98]. First, we prevented bootstrapped TreeMix runs from having an SD = 0 by down sampling the LD-corrected VCF file to 80% of total SNP sites, randomly drawn over 10 replicates using the bcftools -view command. Subset VCF files were converted to TreeMix format and for each down sampled replicate TreeMix was run for all values of m between 1 and 10, and used as input for OptM in R. We then re-ran TreeMix at the optimal value of m = 1 using the full LD-corrected VCF file as input, with all trees rooted at A. fisheri, using A. fisheri as the outgroup and the A. fumigatus subpopulations defined by iterative DAPC analysis as described above. The fixation index (F ST ) was calculated in the R package hierfstat v.0.5.9 [99] using the dosage format to account for unequal population sizes [100] and computed with the fst.dosage() function. Pairwise F ST (the between population fixation index) was calculated in hierfstat using the fs.dosage() $Fst2x2 function. As a second measure of population differentiation, we ran analysis of molecular variance (AMOVA) using the R packages poppr and ade4 [101] with the function poppr.amova() with method = “ade4.” Significant differentiation between populations was evaluated using a Monte Carlo test with the function randtest() in poppr, over 999 iterations. To distinguish introgression from incomplete lineage sorting, we calculated Patterson’s D (ABBA-BABA) in the program Dsuite using the Dtrios tool [102]. Dtrios was run on all 260 strains, set to 3 populations, with population membership encodings determined in the previously reported DAPC analysis, plus A. fischeri as the outgroup. Significance of the deviation of the D-statistic from zero was determined using a 20 block-jackknife approach with a p-value <0.05 indicating introgression. Linkage disequilibrium To estimate linkage disequilibrium (LD) decay, the VCF SNP files (run without A. fischeri) generated above were used in conjunction with the clade assignments generated in DAPCA for K = 3 to assess LD decay using PLINK [91]. LD decay was assessed both for all isolates together and for all isolates separated by clade. Additionally, we assessed LD decay on a sample size rarified set of isolates at n = 12 averaged over 20 independent runs. To do this, the samples were rarefied using a custom BASH script, where for each clade, random draws of 12 isolates per clade were taken over 20 iterations. Sampling did not allow for the same strain to appear more than once in a single iteration, but did allow sampling with replacement between iterations. To specifically evaluate the impact of sample size on LD estimates, we also ran the analysis for n = 5, 10, 20, 50, or 100 isolates (averaged over 20 independent runs) randomly drawn without regard to population structure. For all analyses, strains were subset (when appropriate) from the VCF file as noted above, using vcftools—keep and bcftools—view before converting VCF files to PLINK format using PLINK v. 1.90b3.38. LD decay was estimated in PLINK using the squared genotypic correlation coefficient (r2) over all SNPs present in each group. We considered up to 99,999 variants per window and all r2 values to ensure fine scale resolution. Pairwise distance comparisons were limited to 500 kb (with parameters:—r2 –allow-extra-chr–ld-window-r2 0 –ld-window 99999 –ld-window-kb 500). Distance matrices were constructed in BASH and mean r2 for each distance was assessed across each iteration, and then across each group to generate LD decay curves for each group using ggplot2 [103]. To estimate half-decay values (LD50) in base pairs (BP) for each dataset, we averaged the r2 for each position across all replicates in each group and calculated the r2 mid-point as (minimum r2 + (maximum r2 –minimum r2) / 2). To obtain LD50 in BP, we then calculated the x intercept of the r2 mid-point for each clade using the approx() function in R. Phylogenomics The identified alleles for SNP positions from all isolates were used to construct a phylogeny employing the Maximum Likelihood algorithm IQ-TREE using the +ASC parameter to account for SNP-based ascertainment bias [104]. The best fit model according to BIC score was assessed using the ModelFinder function in IQ-TREE and was determined to be GTR+F+ASC that was run over 1,000 rapid bootstrap iterations. Individual branch support values were assessed using a Shimodaira–Hasegawa approximate likelihood ratio test over 1,000 iterations. Tree rooting was determined by evaluation of the SNP tree run as above but including the outgroup A. fischeri. Based on outgroup position, the root was determined to be at the basal node separating Clades 2 and 3. Tree visualization and trait mapping was carried out using the R package ggtree [105]. Distribution of mating type idiomorphs To identify MAT type for each strain, we downloaded reference CDS DNA sequences for MAT1-1 (GenBank number AY898661.1 from strain AF250) and MAT1-2 (Afu3g06170, NCBI number NC_007196.1 from strain Af293) from NCBI. We constructed databases composed of all scaffolds for each genome, and ran BLASTN [106] with the parameters -evalue. 0001 and -word_size 10, against both MAT idiomorphs. BLAST hits were formatted as fastA files using a custom BASH script and aligned using MAFFT v. 7.471 [107]. To ensure that strains with significant alignments to both MAT1-1 and MAT1-2 were not artifacts in the de novo assembled scaffolds, these strains (n = 11) were further evaluated by aligning the raw reads onto the MAT reference sequences. To accomplish this, reference sequences were indexed using Bowtie2 v. 2.3.4.1 [108] with the bowtie2-build function, and SAM files created from alignments generated with the -very-sensitive-local option. SAM files were converted to BAM using the samtools view function and sorted using the sort function. Depth profiles of the alignments were created using the samtools depth function and graphed in R using the package ggplot2. To create fastA files of the alignments, we used the bcftools mpileup function to create VCF files, indexed the VCF files using bcftools tabix, and used samtools faidx to index the reference sequences. Then, using picard v. 2.18.3, we used the CreateSequenceDictionary function to insert variant information for each strain into the reference MAT sequences and the bcftools consensus function to insert “N”s for all positions where no alignments could be made. To assess the ploidy of these 9 strains, we used whole-genome K-mer analysis on forward reads implemented in the program Jellyfish [109], and visualized using GenomeScope [110], at K = 21. To further assess the ploidy of these 9 strains, we conducted allele frequency analysis on all heterozygous SNP sites aided by the R package vcfR [111]. Identification and mapping of the MAT1-2-4 gene (Afu3g06160) was conducted using BLASTP matches to Af293 gene assignments to the orthogroup clusters and mapped onto the phylogeny as above. [END] --- [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001890 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/