(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Genetic characterization of outbred Sprague Dawley rats and utility for genome-wide association studies [1] ['Alexander F. Gileta', 'Department Of Human Genetics', 'University Of Chicago', 'Chicago', 'Illinois', 'United States Of America', 'Department Of Psychiatry', 'University Of California', 'San Diego', 'California'] Date: 2022-07 Sprague Dawley ( SD ) rats are among the most widely used outbred laboratory rat populations. Despite this, the genetic characteristics of SD rats have not been clearly described, and SD rats are rarely used for experiments aimed at exploring genotype-phenotype relationships. In order to use SD rats to perform a genome-wide association study ( GWAS ), we collected behavioral data from 4,625 SD rats that were predominantly obtained from two commercial vendors, Charles River Laboratories and Harlan Sprague Dawley Inc. Using double-digest genotyping-by-sequencing ( ddGBS ), we obtained dense, high-quality genotypes at 291,438 SNPs across 4,061 rats. This genetic data allowed us to characterize the variation present in Charles River vs. Harlan SD rats. We found that the two populations are highly diverged (F ST > 0.4). Furthermore, even for rats obtained from the same vendor, there was strong population structure across breeding facilities and even between rooms at the same facility. We performed multiple separate GWAS by fitting a linear mixed model that accounted for population structure and using meta-analysis to jointly analyze all cohorts. Our study examined Pavlovian conditioned approach ( PavCA ) behavior, which assesses the propensity for rats to attribute incentive salience to reward-associated cues. We identified 46 significant associations for the various metrics used to define PavCA. The surprising degree of population structure among SD rats from different sources has important implications for their use in both genetic and non-genetic studies. Outbred Sprague Dawley rats are among the most commonly used rats for neuroscience, physiology and pharmacological research; in the year 2020, 4,188 publications contained the keyword “Sprague Dawley”. Rats identified as “Sprague Dawley” are sold by several commercial vendors, including Charles River Laboratories and Harlan Sprague Dawley Inc. (now Envigo). Despite their widespread use, little is known about the genetic diversity of SD. We genotyped more than 4,000 SD rats, which we used for a genome-wide association study ( GWAS ) and to characterize genetic differences between SD rats from Charles River Laboratories and Harlan. Our analysis revealed extensive population structure both between and within vendors. The GWAS for Pavlovian conditioned approach ( PavCA ) identified a number of genome-wide significant loci for that complex behavioral trait. Our results demonstrate that, despite sharing an identical name, SD rats that are obtained from different vendors are very different. Future studies should carefully define the exact source of SD rats being used and may exploit their genetic diversity for genetic studies of complex traits. Funding: This work was funded by the National Institute of Drug Addiction (NIDA; https://www.drugabuse.gov/ ) through R21 DA036672 (AAP; TER; SBF), P50 DA037844 (AAP; TER; SBF), and F31 DA039638-02 (AFG), as well as training grant T32 GM007197 (AFG) from the National Institute of General Medical Sciences (NIGMS; https://www.nigms.nih.gov/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: Data files too large to be included as supplemental content were uploaded to a central online repository through the UCSD Library System: https://doi.org/10.6075/J0XS5V8K . This DOI can be followed to access (1) a VCF file containing all raw SNP calls with dosage data in the original set of 4,625 Sprague Dawley rats, (2) filtered variants in BIM/BED/FAM format for each of the seven SD subgroups, and (3) summary files for the GWAS results for all 55 metric x day combinations for all analyses, individual subgroups and meta-analyses. In addition, FASTQ files from the original ddGBS and light WGS sequencing runs were submitted to the NCBI Sequence Read Archive (SRA) under BioProject identifier PRJNA779552 ( https://www.ncbi.nlm.nih.gov/sra/PRJNA779552 ). Instructions on how to download the data can be found here: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/#download-metadata-associated-wit . Pavlovian conditioned approach (PavCA) is a rodent testing paradigm for assessing the degree of attribution of incentive salience to a reward-associated cues [ 26 ]. Similar to in humans, there are significant individual differences in the tendency of rats to attribute incentive value to reward-predictive stimuli in PavCA and multiple lines of evidence showing that the observed variation is also heritable [ 6 , 26 , 36 ]. All rats in this study were screened for performance in PavCA. Although the genetic analyses reported here were not part of the original design, our study took advantage of the opportunities afforded by that large, behavioral study. We extracted genomic DNA from available tissue samples and then applied double digest genotyping-by-sequencing (ddGBS) to obtain dense genotypes for 4,625 SD rats. We used these genotypes to genetically characterize different populations of SD. Next, we used those genotype data and behavioral phenotypic data to perform what we believe is the largest rodent GWAS ever undertaken. Our results provide insights about the population structure and suitability of SD for GWAS, as well as the genetic underpinnings of the attribution of incentive salience, an endophenotype for the fundamentally human disorder of addiction. We used SD rats from multiple vendors, breeding locations, and barrier facilities to elucidate the genetic background of SD and to perform a GWAS of a complex behavior associated with drug addiction. DNA samples and phenotypic information were obtained from rats used in multiple studies as part of an unrelated Program Project grant (P01DA031656) concerned with individual variation in the propensity to attribute incentive value to food and drug cues [ 26 , 27 ]. The ‘Incentive-Sensitization Theory’ of drug addiction posits that repeated drug use causes persistent neuroadaptations in the brain’s reward circuitry, rendering the neural system hypersensitive to the attribution of motivational value to a reward-predictive cue through Pavlovian learning [ 28 ]. The property acquired by the cue, which allows it to attract attention and reinforce appetitive behaviors, is known as “incentive salience” [ 26 ]. There is significant inter-individual variability in the degree to which salient stimuli command attentional resources in humans [ 29 – 33 ]. It is also known that certain genetic factors can increase an individual’s risk for developing drug abuse [ 34 , 35 ]. It is hypothesized that individuals who are genetically predisposed to attribute greater incentive salience to drug-associated cues are experiencing amplified craving and cue-approach, increasing their risk for developing addictive behaviors. SD rats originated in 1925 at the Sprague-Dawley Animal Company (Madison, WI), where they were created from a cross between a hooded male hybrid of unknown origin and an albino Wistar female [ 19 ]. In 1950, Charles River Inc. began to distribute SD rats commercially. In 1980, Harlan Inc. (now Envigo, Inc.) also began to distribute SD rats after their acquisition of Sprague-Dawley, Inc. [ 20 ]. In 1992, Charles River reestablished a foundation colony of SD rats, using 100 breeder pairs from various existing colonies [ 21 ]. The resulting litters were used to populate SD colonies globally and have since been bred using a mating system that minimizes inbreeding. Every 3 years, Charles River replaces 25% of their male breeders in each production colony with rats from a single foundation colony. Charles River also replaces 5% of their foundation colony breeding pairs with rats from the production colonies on a yearly basis. These practices are intended to reduce genetic drift between the production colonies [ 22 ]. Harlan also reported using a rotational breeding system to limit inbreeding [ 23 ]; however, more detailed information was not publicly available. Since Harlan’s acquisition by Envigo, the process has become more transparent [ 24 ]. Envigo follows a Poiley rotational breeding scheme [ 25 ], whereby animals are cycled through different sections of the colony with each generation, reducing genetic drift and inbreeding. Rats are among the most commonly used organisms for experimental psychology and biomedical research. Whereas research using mice makes extensive use of inbred strains, research conducted in rats typically uses commercially available outbred rats. Sprague Dawley (SD) rats are very commonly used; we identified 4,188 publications in the year 2020 that contained the key word “Sprague Dawley”. Rats identified as “Sprague Dawley” are distributed by several vendors. Each vendor has multiple breeding locations, and each breeding location has one or more rooms or “barriers facilities” which may represent population isolates. Prior studies have identified numerous physiological differences between SD rats obtained from different vendors [ 1 – 5 ]. Despite these observations, many researchers may assume that SD rats obtained from different vendors, breeding facilities or barrier facilities are largely interchangeable. There has been little research into the genetic diversity and population structure that exists among commercially available outbred rats [ 6 ]. Prior rat genetic studies have used F 2 and more complex, multi-parental crosses of inbred strains for quantitative trait loci (QTL) mapping [ 7 – 9 ] and GWAS [ 10 – 13 ]; however, we are unaware of any such studies that have employed commercially available outbred rats. We and others have demonstrated the potential benefits and challenges associated with the use of other species of commercially available outbred rodents for GWAS [ 14 – 18 ], suggesting that similar studies in one of the most frequently used outbred rat strains may also be of value. In addition to Efna5, another late-stage PavCA training day association was identified on chromosome 1 and was linked to the average latency to a lever contact ( S4 File ). The associated SNP falls within a 2.2Mb peak containing 4 genes (Park2, Parcg, Cahm, & Qk) and is located in an intron of Park2, a large 1.4Mb gene. Park2 rat knockout lines show reduced expression of dopaminergic receptors, such as trace-amine receptor TAAR-1 and post-synaptic dopamine D2 receptor, in the striatum [ 51 ]. Dopaminergic signaling is paramount in the establishment of incentive salience in the Pavlovian conditioned approach paradigm [ 38 , 52 ], making this association an attractive target for further investigation. The chromosome 9 associated SNPs were within a broader peak of elevated association (< 1x10 -3 ) that spanned approximately 3Mb. Although this association is of uncertain significance given the number of tests performed in our study, and gene nearest the most associated SNP is not necessarily the causal gene, Efna5 is the only RefSeq annotated gene within this 3Mb region, located ~500kb from the genome-wide significant SNPs. The gene Efna5 codes for an ephrin-A ligand for EphA receptor tyrosine kinases and is expressed in the brain. Ephrin-A5 has long been appreciated for its role in guidance of retinal, cortical, and hippocampal axons during development [ 48 – 50 ], synaptic plasticity of the mature hippocampus, and long-term potentiation [ 49 ]. The titles above each of the eight Manhattan plots indicate the population, number of SNPs, and sample size for each depicted GWAS. The y-axis is the -log 10 of the p-value from the likelihood-ratio test performed by GEMMA. The x-axis is the genomic coordinate of the variants within each chromosome, in ascending order of chromosome. One noteworthy finding was a locus identified in both the Charles River-specific and all-subgroup meta-analyses. The two significant SNPs are located on chromosome 9 and are associated with the training day 5 PavCA index score ( Fig 4 ), which measures a rat’s overall propensity of a rat to attribute motivational value to a conditioned stimulus, and the day 5 probability difference, which directly measures the observed skew in responding to a conditioned stimulus vs. a reward delivery cup ( S4 File , page 35). Interestingly, the directions of effect for the associated alleles were contrasting between the two vendors. In Charles River subgroups, the associated allele had a negative effect on the trait, whereas in Harlan subgroups, the effect was positive. This observation emphasizes the necessity of using a meta-analysis approach that accounts for heterogenous allelic effects, as the association likely would not have replicated in the broader subgroup meta-analysis with standard meta-analysis approaches. Presumably the effects are opposite because the relationship between the SNP and the causal allele is reversed in SD rats from Harlan relative to Charles River. To increase our power and improve our ability to identify true signals of association, we meta-analyzed the results for all seven subgroups together, as well as in smaller sets of subgroups derived from either Charles River or Harlan. The meta-analyses were run with MR-MEGA [ 47 ], a program originally designed to accomplish trans-ethnic meta-analyses in populations with potentially heterogeneous allelic effects, which is conceptually similar to this dataset. Genome-wide significant results from the meta-analyses are presented in S1 File , Manhattan plots in S4 File , and Q-Q plots in S5 File . There were five genome-wide significant loci identified in the Harlan-specific meta-analyses, two in the Charles River-specific meta-analyses, and five in the Harlan plus Charles River meta-analyses. Due to the sparsity of SNP genotypes assayed by the ddGBS approach, these loci are generally composed of only a handful of SNPs and their boundaries are not well defined. Notably, of the seven loci identified in the vendor-specific meta-analyses, three were replicated in the Harlan plus Charles River meta-analyses. The four loci that did not replicate were composed of SNPs that were unique to either Charles River or Harlan and therefore could not have been replicated in the Harlan plus Charles River meta-analyses. For each of the seven subgroups, all 11 PavCA metrics were examined for all five days of training. Although PavCA data are often analyzed such that rats are categorized as sign-trackers, goal-trackers, and intermediate responders based on their PavCA performance on days 4 and 5 [ 37 ], we chose to analyze behavior on all 5 days since those results could yield insights into loci involved in learning this complex behavior. Significance thresholds were determined using MultiTrans, a parametric bootstrapping resampling approach [ 45 ]. Genome-wide significant results from the individual subgroup GWAS are available in S1 File , and the Manhattan plots for each analysis can be viewed in S2 File . The sample sizes for the subgroup analyses were small and the GWAS yielded a marginal number of genome-wide significant results. In total, we discovered six independent genome-wide significant hits across five of the seven subgroups, representing PavCA metrics from days 3, 4, and 5 of training. All six significant hits were unique to the subgroup they were identified in. This could be due to incomplete power, highly variable allele frequencies between the groups, or the existence of unique modifier loci. Alternatively, these hits may be artifactual and occurred by chance due to the number of analyses that were run. Importantly though, we did not observe any significant inflation in the Q-Q plots for the GWAS in these 7 subgroups ( S3 File ). To further assess how well we controlled population structure, we used LD score regression [ 46 ]. We found that the average deviation from the expected intercept was only 0.027 for the subgroups ( S6 Table ), indicating that dividing the samples into seven subgroups was sufficient to control population structure. Next, we performed GWAS for various metrics derived from Pavlovian conditioned approach by using GEMMA to fit a linear mixed model that allowed us to account for population structure with a relatedness matrix. Initially, we planned on performing the GWAS separately for rats from Charles River vs. Harlan to maximize sample size; however, we observed marked inflation in our Q-Q plots for a number of PavCA metrics ( S5 Fig ). This pattern of inflation could reflect true signal that is amplified by the large LD-blocks characteristic of small breeding populations==if a locus shows significant association with a trait, all SNPs co-occurring in the same LD-block will have inflated test statistics, thereby artificially enriching the Q-Q plot with mid-range p-values. However, inflation in a Q-Q plot might also reflect a failure to account for population structure. Given the degree of the observed inflation, we suspected that the LMM might have been insufficient to account for the residual subgroup structure within each vendor. Therefore, we chose to perform individual GWAS on each subgroup followed by meta-analysis of the results. Although evidence from selective breeding studies has suggested that behavior in the Pavlovian conditioned approach procedure is heritable [ 36 ], we are not aware of any specific heritability estimates for PavCA. We used GCTA to calculate the proportion of the variance in the base and composite PavCA metrics that could be explained by the set of SNPs unique to each subgroup and the within-vendor SNP sets ( S5 Table ). Due to the low subgroup sample sizes, the heritability estimates generally only reached significance when considering the vendor as a whole. The SNP heritability estimates for all PavCA metrics in Harlan ranged from ~4–11%, whereas they were ~4–21% for Charles River; on average the estimated heritability was about ~1.9-fold greater in Charles River. Importantly, some of the highest heritabilities were for metrics used to designate sign-trackers vs. goal-trackers, such as the average of day 4 and day 5 response bias, probability difference, and PavCA index score. However, even the heritability estimates from Charles River were lower than SNP heritability estimates for many other behavioral traits [ 12 , 15 , 16 , 44 ]. We speculated that some of the rats in our sample might also share close genetic relationships with one another. We used plink 1.9 [ 42 , 43 ] to estimate the pairwise proportions of identity-by-descent (IBD; S4A and S4C Fig ), which showed that while most rats were only distantly related, a subset shared closer familial relationships. We removed several rats that showed high levels of relatedness with many other samples (presumably due to technical error), as well as any with unreasonably high levels of IBD ( S4 Table and S4B and S4D Fig ). The fixation index (F ST ) is a statistic widely employed by population geneticists to measure the level of structure in populations [ 40 ]. It is calculated using the variance in allele frequencies among populations; values closer to 0 indicate genetic homogeneity, and values closer to 1 indicate genetic differentiation. We calculated the pairwise F ST between all seven subgroups ( Table 1 ). F ST values between vendors were ~0.423, which is over 2-fold greater than corresponding estimates for major human lineages [ 41 ], whereas the values for different subgroups within a vendor were substantially lower. We observed a large difference in the MAF distributions for subgroups of Charles River vs Harlan origin ( Fig 3A ), with Charles River rats harboring more variation and a far greater proportion of SNPs with high MAF (>0.05). This observation could reflect the fact that Charles River has adhered to their International Genetics Standardization Protocol for 25+ years, whereas Harlan appears to have focused on maintaining diversity within breeding colonies and may have allowed for a moderate degree of drift between them. Because our ddGBS approach focused on the subset of SNPs that are near to particular restriction sites, we are not able to accurately quantify the total number of variants nor nucleotide diversity in these populations [ 39 ]. We went on to examine the levels of linkage disequilibrium (LD) in the subgroups by constructing LD decay curves ( Fig 3B ). These curves show the rate at which LD between two genetic loci dissipates as a function of the distance between the loci. Harlan rats had more extensive LD compared to Charles River. For contrast, we included the decay curve for Swiss Webster (CFW) mice, a commercially available outbred mouse population that has been successfully used for GWAS in the past [ 15 , 16 ]. Due to the stark clustering of samples within and across certain barriers ( Fig 2C and 2D ), we opted to divide the sample set up based on PCA clustering for all subsequent analyses, including the GWAS for PavCA. This heuristic sample division resulted in seven subgroups, each containing genetically-similar samples: SD rats originating from Harlan 202A/202C/208A, Harlan 206, Harlan 217, Charles River R09/P03/P07/P10/C71/K92, Charles River R04, Charles River P09, and Charles River C72. Variant filtering steps were reapplied to each subgroup individually to reach final sets of SNPs ( S3 Table ). A more lenient MAF threshold was applied to permit more variants to pass filtering. (A) Map of the nine vendor breeding locations and the number of SD rats obtained from each location. Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL. (B) A summary of the genetic data from all 4,061 SD rats based on principal components 1 and 2 from PCA. Each point represents a sample. The left cluster is composed of samples from Charles River and the right clusters are composed of samples from Harlan. (C-D) Repeated PCA analyses on subsets of the samples from Harlan and Charles River, colored by barrier facility of origin. To further investigate potential genetic divergence between Charles River and Harlan, we performed principal component analysis (PCA) on a set of 4,502 LD-pruned (r 2 < 0.5) SNPs with minor allele frequency (MAF) > 0.05 across both vendor populations ( Fig 2B ). The first PC corresponded to vendor (Charles River or Harlan) and accounted for ~33.7% of the variance. The second PC accounted for just ~0.9% of the variance and reflected population structure within Harlan SD rats. To investigate the within-vendor structure further ( Fig 2A ), we performed PCA on the samples from each vendor separately using the same set of SNPs. Fig 2C and 2D show evidence of substructure at both the level of breeding location (i.e. the city) and barrier facility (i.e., the segregated breeding areas within the building). Interestingly, rats from some barrier facilities showed greater divergence from barrier facilities within the same breeding location than barriers at other locations. After combining the two SNP sets, we identified a total of 234,887 unique, high-quality, bi-allelic SNPs. Using 381 samples with two replicates of library preparation and sequencing, we were able to evaluate our genotyping accuracy by looking at the discordance between calls made in the replicates. The discordance rate between replicates was calculated to be ~1.7%. However, since discordance putatively reflects random errors in either replicate, it is expected to be approximately twice the true error rate, suggesting that our genotypes were >99% accurate (0.85% error rate; although we cannot exclude the possibility that certain errors were consistent across samples). Variant calling was performed on the full sample of genotyped SD rats. Subsequent quality and minor allele frequency filtering of the variants was initially completed for each vendor separately, under the notion that the observed differences in PavCA performance between vendors may reflect underlying genetic variation. We identified more single nucleotide polymorphisms (SNPs) for the rats from Charles River compared to Harlan (214,309 vs 114,568). Fig 1C compares the distribution of SNPs across each chromosome for each vendor. Fewer SNPs discovered in Harlan resulted in a larger number of ‘SNP deserts,’ which can be seen by the increase in the number and breadth of the troughs in Harlan’s distribution as compared to Charles River (e.g., the end of chromosome 11, beginning of 13, and middle of 14). The observed jaggedness of the distribution is likely due in part to the use of ddGBS ( S1 Text ), which only samples loci adjacent to specific restriction cut sites. However, the peaks and troughs may also represent meaningful variability in the density of polymorphism across the genome and regions of identity-by-descent due to population bottlenecks, or the technical limitations of assaying low-complexity regions via ddGBS and Illumina sequencing. (A) Samples were classified as sign-tracker, goal-tracker, or intermediate responder based on their average day 4 and 5 PavCA index score. PavCA index scores for each day of training (1–5) were averaged across all sign-trackers, goal-trackers, and intermediate responders. Bars around the points represent the standard deviation of the averages. (B) Density curves of average day 4 and 5 PavCA index score in Harlan (n = 2,281) vs. Charles River (n = 1,780) SD rats. (C) Filtered SNP density per 1Mb window in Charles River vs. Harlan samples for all 20 autosomes. Charles River and Harlan rats had significantly different average PavCA index score distributions ( Fig 1B ; two-sample Kolmogorov-Smirnov test, p-value < 2.2x10 -16 ), with Charles River rats biased towards goal-tracking and Harlan rats biased towards sign-tracking. We divided the samples further into the breeding locations that the rats originated from to investigate intra-vendor behavioral variance ( S1 Table ). The differences in PavCA index score between breeding locations within vendor were smaller, but still significant ( S3 Fig ) [ 6 ]. The final quality controlled dataset consisted of 4,061 genotyped male SD rats that were also phenotyped for Pavlovian conditioned approach; 2,281 from Harlan and 1,780 from Charles River, from 5 and 4 different breeding locations, respectively. As noted previously [ 37 ], we found that the metrics used to describe performance in PavCA are highly correlated ( S1 Fig ). Additionally, several of the base and composite PavCA metrics had tail-heavy distributions due to biased responding in sign- and goal-tracking from the animals during the testing time periods ( S2 Fig ). For this reason, we chose to quantile normalize all measurements prior to mapping. The PavCA index score [ 37 ] is the composite measure used to categorize rats into sign-trackers, goal-trackers, and intermediate responders based on their performance in the paradigm. It has previously been shown that the behavior, and therefore the PavCA index score for a given animal, will stabilize after 4–5 days of training, allowing rats to be binned into these groups based on their performance [ 38 ]. We observed the previously reported divergence and stabilization for sign-trackers and goal-trackers over the five days of testing ( Fig 1A ) [ 37 ]. Our study is the first to use SD rats for GWAS of a complex trait and is the largest rodent GWAS ever undertaken. We have exposed extensive population structure among SD rats that has important implications for genetic and non-genetic uses of SD rats. Although our GWAS was hampered by population structure, our results have identified several genome-wide significant loci that provide the first insights into the genetic basis for PavCA, a trait that is relevant to understating motivational processes, especially in the context of substance abuse. Our data also suggest that Charles River SD rats have greater genetic diversity, finer scale LD and more favorable MAFs than those from Harlan. More important perhaps is that our ddGBS approach was able to accurately genotype 291,438 SNPs from an in silico approximated 3.31% of the genome [ 65 ]. If we extrapolate this to the whole genome, it would imply that over 8.8 million SNPs may exist within Sprague Dawleys from these two vendors. This estimate exceeds the empirically observed 7.2 million SNPs in the rat heterogeneous stock and the 3.6 million in the HXB/BXH recombinant inbred panel and is on par with the 9.18 million SNPs present across over 40 rat inbred lines [ 12 , 66 , 67 ]. Further validation of this claim will require much denser genotypes of SD rats with deep, whole-genome sequencing; however, with the appropriate study design, this high level of genetic diversity make the SDan attractive choice for future GWAS using outbred, commercially available rats. Regardless of whether differences in SD rats obtained from different vendors are due to genetic or environmental differences, our results demonstrate the need for much greater care in the use of SD rats. Even when the primary research question is non-genetic, researchers should carefully consider whether to use rats from a single vendor and/or breeding facility. Ignoring the genetic background of test subjects can easily confound experimental results. The exact sources (vendor, breeding facility, barrier within each breeding facility) should also be carefully documented and clearly reporting in the results section. Failure to do so can lead to problems with replication, since observations made in SD rats from one source may not extend to SD rats from other sources. In addition to non-replication, a more subtle consequence of using SD rats from multiple vendors or breeding facilities is that spurious correlations can occur. For example, if SD rats from Charles River were higher for traits A and B, compared to SD rats from Harlan, a heterogeneous cohort of SD rats from both vendors would show a significant positive correlation between traits A and B. Such a correlation may not be due to any shared biological mechanisms but could instead be the result of either genetic population structure or environmental differences between SD rats from the two vendors. Thus, even studies that are not intended to address genetics questions could lead to incorrect inferences because of genetic differences among SD rats obtained from different sources. Because our project used samples from an ongoing, non-genetically focused project, we inherited a design that used Sprague Dawley from many different sources. The full impact of this was not clear until we completed genotyping the rats. The population structure that we encountered was similar to human genetic studies that include individuals from multiple ancestries, and we sought to address it by analyzing each group separately (with an LMM) and then combining the results using meta-analysis (e.g. [ 63 ]). In humans, different ancestry groups (e.g. East Asian vs. European) often do not share the same causal variants. Similarly, our power was likely reduced because causal variants were not shared between the subgroups. Modifiers of causal variants might also be dissimilar between the various subgroups, further hindering our meta-analyses [ 64 ]. Although the majority of identified variants do not have clear functional implications, it is possible that these loci are tightly linked to variants that influence the expression or function of nearby protein-coding genes. Among the 46 significant associations from the full sample set meta-analyses ( S1 File ), we have highlighted two loci. The first locus was discovered on chromosome 9 and was associated with the PavCA index score on the final day of testing, when the rats’ behaviors have typically plateaued ( Fig 1A ). The only gene within the associated genomic interval was Efna5, which codes for an ephrin-A ligand for EphA receptor tyrosine kinases and has known ties learning and memory through its effects on hippocampal development and synaptic plasticity [ 48 – 50 ]. The second was an intronic variant in Park2. In rats, knocking out this gene has been shown to reduce expression of dopaminergic receptors, notably TAAR-1, a post-synaptic dopamine D2 receptor. Seeing as dopamine is essential for the attribution incentive salience to reward-associated cues [ 38 , 52 , 62 ], this result is an enticing one for future follow-up studies. Because ddGBS is inexpensive, we were able to genotype thousands of SD rats. ddGBS relies on the large LD blocks that are found in commercially-available outbred rodent populations ( Fig 3B ). Due to the uneven distribution of SNPs obtained from ddGBS ( Fig 1C ), it was sometimes difficult to exactly localize the loci we discovered. Thus, we identified loci that were typically >1 Mb, which are consistent with the linkage disequilibrium analyses presented in Fig 3B . One question that would be important for designing future GWAS using SD rats is the number of markers that are needed to perform GWAS. While we would always advocate for methods that obtain as many SNPs as possible, our data allow us to estimate the minimum number of markers needed. First, we considered the number of SNPs remaining after pruning at various pairwise r 2 thresholds ( S6 Fig ) along with the size of the rat genome (~2.87Gb). Assuming loci are considered unlinked at an r 2 < 0.2, we estimate that a minimum of 11,500 SNPs would be needed for Charles River and 7,000 for Harlan (MAF > 0.01). This is an underestimate because having only unlinked markers would provide poor detection and resolution of QTLs, and because SNPs obtained are seldom uniformly distributed. Therefore, we recommend future studies to obtain as many markers as possible, with 100K to 1M or more SNPs providing good insurance that even difficult regions will be adequately covered. As this was the first genome-wide association study to examine PavCA, we had to choose from among many possible ways of summarizing the phenotype. Ultimately, we decided that the analyses should be comprehensive. Therefore, we ran GWAS for all possible quantile normalized PavCA component metrics since it was unclear which had the most predictive power with regard to the overarching behavioral paradigm. We also considered applying a case-control approach to a number of the metrics, where sign- and goal-trackers would be the conditions; however, the artificial subdivision of a continuous trait did not prove productive. In the individual subgroup GWAS, which had limited power due to sample size, we identified only six genome-wide significant loci (alpha = 0.05) despite performing 385 total tests (5 training days x 11 PavCA metrics x 7 groups). While 385 tests clearly creates a large multiple testing problem, the primary goal of performing these component GWAS was to use them for a meta-analysis of the 7 groups, meaning that 55 meta-analyses were the primary focus of our study. If we applied a Bonferroni correction for the 55 meta-analyses, none of these results would reach significance. However, a Bonferroni correction is overly conservative because of the strong correlations between the 55 tested PavCA metrics ( S1 Fig ). Interestingly, despite these strong correlations, some of the loci we detected were not found for any other PavCA metric or training day. Furthermore, nearly half of the QTL identified in the subgroup and meta-analysis GWAS occurred on earlier training days (days 1–3), while the literature surrounding PavCA focuses on behavior after training (days 4, 5). Fortunately, we did observe replication of several associations in the meta-analyses across multiple PavCA metrics ( S1 File ). Using genome-wide genotype data, we have provided the first quantitative estimates of the SNP heritability of the component measurements of PavCA. The highest heritabilities (16–20% in Charles River) were seen for measures typically used to assess the propensity to attribute incentive salience to reward cues, including the averages of the day 4 and day 5 PavCA index score, response bias, and latency score. Previous work had shown that SD rats selectively bred for ~15 generations for high or low responses to a novel environment were also highly divergent for behaviors in PavCA [ 36 ]. Those selection studies demonstrate that SD rats have alleles that influence PavCA. The present results are consistent with this conclusion. We obtained lower heritability estimates for SD rats from Harlan compared to Charles River, further emphasizing the differences between SD rats from the two vendors. Although this study is the first to carefully document population structure within SD rats, it is not the first to highlight phenotypic differences among SD from different vendors. In 1973, Prejean et al. reported that the incidence of endocrine tumors varied among SD rats from different vendors [ 1 ]. Then Clark et al. (1991) [ 58 ] reported differences in noradrenergic neural projections among SD rats from different vendors. Subsequently, Turnbull & Rivier (1999) [ 4 ] reported vendor-specific differences in the response to inflammatory stimuli. Then Fuller et al. (2001) [ 59 ] reported vendor-specific differences in hypoxic response among SD rats. Even more recently, there have been additional publications reporting differences for a variety of traits among SD rats obtained from different vendors [ 2 , 5 , 60 , 61 ], and even suggesting that these phenotypic differences may extend to differences among a single vendor’s breeding facility [ 3 ]. Our own studies have previously reported both behavioral and genetic differences among SD rats obtained from different sources [ 6 ], observations that are much more comprehensively explored in the present dataset. Specifically, in addition to wide-spread genetic differences, we have also shown that the SD rats obtained from Harlan for these studies show a much higher proclivity to become sign-trackers compared to SD rats from Charles River. However, neither these prior publications nor the current one can differentiate between two possibilities: that the observed behavioral differences are the result of the different environment in which these animals are raised versus the genetic difference that we have clearly demonstrated. This question could be addressed by future studies in which SD rats are bred in the same facility and the offspring tested in the same manner. We found dramatic genetic differences between SD rats obtained from Harlan versus Charles River. F ST estimates show that SD rats from Harlan and Charles River are more diverged than the major human ancestry groups [ 41 ] and nearly as diverged as some subspecies of mice [ 57 ]. There was also strong evidence of population structure among the various breeding locations and barrier facilities for each vendor. We found that SD rats from both Harlan and Charles River both showed a rapid decay in LD. However, at the time this study was performed, SD from Charles River appear to have more polymorphisms and more favorable MAF and LD profiles, suggesting that they would be preferable for future GWAS. Numerous behavioral and physiological studies use outbred Sprague Dawley rats obtained from commercial breeders [ 53 ]. The vast majority of these studies are not designed with genetics in mind. While SD and other outbred rats have been used for genetic selection [ 54 – 56 ], ours is the first to explore the use of SD for GWAS. We adapted our genotyping methods to SD and used them to densely genotype more than 4,000 SD rats. These data allowed us to characterize the genetic background of SD rats and to perform GWAS for Pavlovian conditioned approach behavior. This represents the largest rodent GWAS ever undertaken, and the first performed using a commercially available outbred rat population. Methods Sprague Dawley samples Tissue samples from 5,206 male Sprague Dawley rats were obtained, predominantly from Charles River and Harlan, with a few samples from Taconic. A subset of 4,625 of these rats went on to be genotyped by ddGBS and/or WGS. After sample filtering, a final set of 4,061 genotyped SD rats were used for the population genetic and association analyses. S1 Table lists the number of samples that came from each vendor, breeding location, and barrier facility. Detailed information about these 4,061 rats is available in S6 File. Behavioral testing was performed between February 2012 and August 2015 as part of work for multiple studies [27,68–80]. All experiments were approved by The University of Michigan Committee on Use and Care of Animals (UCUCA; Approval Number PRO00006685). Housing, feeding, lighting and other relevant environmental conditions have been previously described. Following sacrifice at the University of Michigan, tissue samples were shipped to the University of Chicago; subsequent processing of those samples is described in the following sections. Pavlovian conditioned approach Pavlovian conditioned approach procedures have been thoroughly described previously [81,82] as a means to assess the tendency to attribute incentive motivational value or incentive salience to a cue that has been repeatedly paired with a noncontingent reward. Briefly, rats are placed into a testing chamber in which an illuminated lever (conditioned stimulus; CS) enters the chamber and after 8 seconds the lever-CS retracts and a food pellet (unconditioned stimulus; US) is immediately delivered into an adjacent food cup. Rats are scored for their three possible responses to the lever-CS entering the cage: approach and interact with the lever, approach and interact with the food receptacle (magazine), or make neither approach. Conditioned responses are captured during the 8-second period during which the lever-CS has entered the chamber, but before the food reward enters the magazine. The following measures are obtained in automated fashion: the number of lever contacts as measured by lever depressions, number of magazine entries as measured by infrared sensor in the food receptacle, and the latency to both during the 8-second lever-CS presentation. The rats are tested in this manner with 25 trials per session, and one session is conducted per day for 5 consecutive days. For the purposes of this project, the number of lever contacts and magazine entries are summed across all 25 trials within a given session, and the latencies are averaged across 25 trials within a session. Along with response counts and latencies, three additional measurements are recorded: 1) the proportion of trials in a session during which a rat made a lever contact (“probability” of lever press), 2) the proportion of trials during which they made a magazine entry (“probability” of magazine entry), and 3) the number of non-CS (NCS) magazine entries that occurred outside of the 8 second trials (when the cue was not present during the intertrial interval). We also calculated composite scores to categorize rats as sign-trackers (ST; defined as rats that preferentially interacted with the lever-CS), goal-trackers (GT; defined as rats that preferentially interacted with the food magazine), and intermediate responders (IR; rats that vacillated between sign- and goal-tracking behavior) [37]. These scores include: response bias ([lever presses–magazine entries]/[lever presses + magazine entries]), latency score ([average magazine entry latency–average lever press latency]/8), and probability difference ([lever press probability–magazine entry probability]). The PavCA index score is the average of the response bias, latency score, and probability difference. A value of [-1, -0.5] for the PavCA index score indicates a goal-tracker, (-0.5, 0.5) indicates an intermediate responder, and [0.5, 1] indicates a sign-tracker. We performed a two-sample Kolmogorov-Smirnov (K-S) test to show that the PavCA index score distributions differed significantly between Charles River and Harlan SD rats (D = 0.28418, p-value < 2.2x10-16). In summary, 11 PavCA metrics were available for analysis, each of which we measured on days 1, 2, 3, 4, and 5 (S7 Table). Phenotypic data is available in S6 File. Double digest genotype-by-sequencing (ddGBS) To obtain genotypes, we used ddGBS, a genotyping method that reduces the complexity of the genome by only sequencing regions proximal to restriction enzyme cut sites [6,83]. We have recently described the technical aspects of this protocol in detail [65]. The ddGBS protocol used in this paper is a synthesis of the GBS approach described in Graboski et al. [84] and used more recently by Parker et al. [15] and Gonzales et al. [44], and an analogous approach known as double digest restriction associated DNA sequencing (ddRADseq) [85]. The full protocol is available in S1 Text. DNA was extracted from rat tails using the PureLink Genomic DNA kit. DNA purity was assayed using a Nanodrop 8000 (260/280 ≥ 1.8) and DNA integrity by gel electrophoresis (minimal smearing). Genomic DNA was then digested using two restriction enzymes: PstI (6-bp recognition site) and NlaIII (4-bp recognition site). Adapter oligos were annealed to overhangs left by Pst1 and NlaIII. The PstI adapters contained 48 unique 4–8 bp in-line indexes [15,44,84]. A Y-adapter was annealed to the NlaIII cut sites, which controlled the direction of the first round of PCR amplification and thus ensured that the library was primarily composed of fragments with one of each of the adapters. Post-annealing, sets of 24 individual sample libraries were quantified and pooled. Pooled libraries were PCR amplified for 12 cycles, size-selected for 300-450bp using the Pippin Prep, and quality checked by Agilent Bioanalyzer (peak range ~ 300-500bp and conc. ≥ 20nM). Sequences for the 48 barcoded adapters, Y-adapter, and PCR primers are provided in S2 Text. Sequencing of pooled libraries was performed by Beckman Coulter Genomics (now GENEWIZ). Sequencing was carried out on the Illumina HiSeq 2500 using v4 chemistry and 125-bp single-end reads. Each lane consisted of a pool of 24 samples, resulting in an average of 8.9 million reads per sample. A total of 4,608 unique ddGBS sample libraries were sequenced. Of these samples, 381 were sequenced twice, resulting in two sets of sequencing data for each sample from the same library prep that were used for to check genotype concordance (S2 Table). Light whole-genome sequencing (WGS) To discover new variants and support imputation, we performed low-coverage whole-genome sequencing of 80 SD rats from this same cohort. The rats were selected to represent sign-trackers, goal-trackers, and intermediate responders from each of the major barriers within the 6 major subgroups of Harlan and Charles River. Sample libraries were prepared using the Illumina TruSeq PCR-Free Library Prep kit and quality checked using an Agilent Bioanalyzer and qPCR on an Applied Biosystems StepOne Real-Time PCR System to ensure they met Illumina quality standards. Sample pooling (10 samples per pool) was performed by Beckman Coulter Genomics. Each pooled library was sequenced on two lanes of an Illumina 2500 flow cell with 125-bp single-end reads, resulting in an average of 51.6 million reads per sample per two lanes. Assuming that the rat genome (rn6) is ~2.87Gb, this provided about 180x coverage of the rat genome, or about 2.25x coverage per rat. ddGBS Sequence data processing We have recently described the bioinformatic steps that we use for ddGBS in detail [65]. We follow an analogous approach in this paper, though we deviate at the imputation step due to our use of SD instead of HS rats. Briefly, raw reads from ddGBS were demultiplexed using FASTX Barcode Splitter v0.0.13 [86], allowing for 1 mismatch. After demultiplexing, barcodes were trimmed by cutadapt v1.9.1 [87]. Any reads not matching a sample’s barcode within 1-bp were filtered out. We removed 316 samples for which there were less than 4 million reads, leaving 4,292 samples with ddGBS data. We also used cutadapt to trim low-quality base pairs (phred quality score < 20) at the ends of the reads, and to remove 3’ adapter sequences. Reads trimmed to less than 25-bp were discarded. Next, all reads were aligned to the rat reference genome assembly (rn6) using bwa v0.7.5a [88]. All ddGBS reads were realigned at known indel sites by GATK’s IndelRealigner v3.5 [89]. Because of a lack of SD-specific variant data, we used variant data from 42 whole-genome sequenced rat strains and substrains [67] as the reference set for indels. We then used GATK to perform base quality score recalibration (BQSR) on the BAM files, using data (SNP & indel) from the 42 rat genomes as the “known” set of variants. For the ddGBS samples that were sequenced twice (381 remaining after filtering for read count), we performed all quality control and variant calling steps in parallel, since our goal was to compare calls made in these samples as a means of estimating the genotyping error rate. Light WGS data processing Raw reads from WGS were processed in an analogous manner to the ddGBS data (detailed above) through the alignment step. After alignment, duplicates reads were removed using picard [90]. Reads were then realigned and underwent BQSR. The final WGS BAM files from each lane (2 files per sample) were merged. The WGS BAM files for the 63 samples that had undergone both ddGBS and WGS were then merged. Variant discovery and imputation–ANGSD/Beagle We have previously compared GATK’s HaplotypeCaller tool [89] to other approaches and found that for ddGBS data, the Samtools variant calling model [91] as implemented by ANGSD v0.911 [92] is a better method for estimating genotype likelihoods from the mapped ddGBS reads [65]. Genotype likelihoods convey a layer of uncertainty in the called genotype based on the underlying aligned reads, which can be useful for downstream imputation by allowing for more accurate probabilistic modeling. This information is typically lost when strictly using “hard” genotype calls. Likelihoods were obtained in 10Mb chunks of the genome, which were subsequently concatenated. Major and minor alleles were inferred from the data based on allele frequency estimates made from the genotype likelihoods. The likelihoods were only estimated at sites where at least 100 samples had reads. ddGBS data results in low call rates at many loci. However, we retained these loci because we anticipated they would be useful for imputation. Next, we used the ANGSD genotype likelihoods to impute missing genotypes (that is for SNPs where only a portion of the rats had genotyping information) using Beagle 4.1 [93,94], which produced a VCF file containing hard genotype calls (0,1, or 2), dosages ([0,2]), and posterior probabilities for each genotype ([0,1]) for 2,274,118 biallelic SNPs in 4,309 rats (ddGBS+WGS). This is the unfiltered set of SNPs and samples we moved forward with for all subsequent steps. We elected not to pursue variant quality score recalibration using the GATK VariantRecalibrator algorithm [95], because we did not have the required “truth” SNP set. Due to the poorly understood population history of SD rats, it was unclear whether the variation present in the 42 rat genomes would be representative of the variation present in our sample. Using the 42 genomes as a reference for the VariantRecalibrator may also have negatively impacted the calling of novel SD variants. Genotype concordance check Whereas some of our past projects that used GBS were able to determine the accuracy of GBS genotypes by comparing them to genotypes obtained from SNP microarrays, we did not have microarray-based genotypes for this cohort. Instead, we relied on the remaining 381 duplicate samples whose genotypes were called in parallel. To estimate genotyping accuracy, we compared the rate of concordance of hard genotype calls among the duplicate samples. We first filtered variants by dosage r2 (DR2), a measure of the accuracy of the genotype imputation performed by Beagle. We tested three different DR2 thresholds (≥0.7, ≥0.8, ≥0.9). We then removed variants with MAFs < 1% or that violated Hardy-Weinberg equilibrium at a threshold of 1x10-7 in either vendor population. Concordance rates were checked by two methods: 1) by using hard calls in the RAW format from plink 1.9 and dividing the number of times a genotype call matched between duplicate samples by the total number of SNPs and 2) by taking the mean Pearson correlation of the dosages of the duplicate samples. The results are presented in S2 Table. Similar error estimates were obtained by the hard call and dosage approaches. We chose to move forward with the DR2 threshold of 0.9 for all subsequent analyses, which yielded an error rate of ~0.85%. Post-genotyping sample filtering We removed female rats (n = 77) and rats from Taconic Farms (n = 4) since they made up a very small fraction of the total sample. We also removed rats that showed poor clustering in the PCA analysis, described below. We filtered out individuals with unusually high or low rates of heterozygosity and high degrees of relatedness as detailed below. Lastly, we excluded a small set of duplicate samples (n = 7) and samples missing phenotype data for mapping (n = 10). All filters and sample numbers are listed in S4 Table. After these steps, 4,061 unique male SD rats from Charles River (n = 1,780) and Harlan (n = 2,281) remained. Principal component analysis, identity-by-descent, and heterozygosity We performed principal component analysis (PCA) on the cohort of 4,228 samples filtered for low read count, having been received from Taconic, and females. PCA was performed on genotype calls in R using the prcomp function in R [96] on a set of variants pruned for SNPs with MAF ≤ 0.05 in the combined sample set, SNPs in pairwise LD with r2 > 0.5, and SNPs violating HWE at a p-value < 1x10-7 in either Charles River or Harlan. The first PC clearly separated animals from Harlan and Charles River; however, there was a set of 54 rats that did not visually cluster as expected at the level of vendor (S7 Fig). These animals were removed from all subsequent analyses (we assumed they reflected inaccurate records, sample mix-ups, or some other technical problems). With this further reduced set of 4,174 rats, we continued on to assess the genetic relationships among the rats in our sample. SD rats were ordered in multiple batches over several years, and we suspected some of these rats would be closely related (siblings, cousins, etc.). We reapplied the variant filters used for PCA and utilized the==genome function in plink 1.9 on the resulting SNP set to estimate (the proportion of genotypes predicted to be identical by descent), for all pairs of samples. S4A and S4C Fig show that while most of the animals were unrelated, there were a significant subset of closely related pairs, as well as some pairs with exceedingly high IBD1 rates in Harlan. We used the plink 1.9 function==het to examine possible inflation or deflation of heterozygosity rates in our samples. S8A and S8C Fig show that a handful of samples in both populations with uncharacteristically high (> 0.25) or low (< 0.25) rates of heterozygosity. We filtered out 34 such samples, as we found they drove the majority of the anomalous signal in our pre-filtering plots in S4 Fig. We also removed 12 samples with more than 30 close relations (defined as ≥ 0.1875) and 32 samples that had a ≥ 0.6 with another sample (likely sample contamination/mix-up). After applying these sample filters, we were left with S4B and S4D and S8 Figs. After reaching our final set of 4,061 rats, we reapplied the SNP filters on the reduced sample set, resulting in 4,502 SNPs that were polymorphic in both populations. We ran PCA on these SNPs as described above. We then repeated PCA on the samples from Charles River and Harlan separately to examine substructure within each vendor population. Using the results from the vendor-level PCA analysis, new groupings of barrier facilities were determined based on genetic similarity. Barrier facilities that clustered most highly in the PCA were grouped together for subsequent analysis. Final variant filtering and minor allele frequency spectrum After establishing the final cohort of 4,061 rats, we sought to establish a final set of SNPs to be used for the association analyses. Starting from the initial full set of 2,274,118 SNPs, we first removed sites with DR2 < 0.9. Then, based on the PCA results, we separated the samples into seven subgroups, each containing genetically-similar samples that clustered highly: SD rats originating from Harlan 202A/202C/208A (Har 202A), Harlan 206, Harlan 217, Charles River R09/P03/P07/P10/C71/K92 (Charles River R09), Charles River R04, Charles River P09, and Charles River C72. These seven groupings were used for all subsequent variant filtering performed in plink 1.9, which converts the posterior probabilities we received from Beagle to hard genotype calls, so long as the probability is ≥ 0.9. In each cluster, we removed SNPs with MAF < 0.0001 using the–maf option in plink 1.9. Lastly, we used the==hardy function in plink 1.9 [97] to perform tests for SNP Hardy-Weinberg equilibrium and removed all SNPs with an HWE p-value < 1x10-7 in any of the seven groups. The final counts of samples and SNPs in each barrier grouping are available in S3 Table. We used the==freq option in plink 1.9 to estimate the MAFs for these final filtered sets SNPs. The distributions are show in (Fig 3A). Taking the union of these seven sets of SNPs, there were a total of 291,438 polymorphic sites passing our filters across all subgroups. Our final set of filtered SNPs contained 75,405 novel SNPs that had not be previously reported by Hermsen et al. [67]. We also compared our final set of SNPs to the most recent dbSNP release for rats (Build 149, November 7, 2016) and found that 178,545 of the 291,438 SNPs we discovered were not present in the current dbSNP build. See the Data Availability section for more information on accessing this data. Fixation Index (F ST ) To quantify the population divergence between SD rats from Harlan and Charles River, we computed F ST for each SNP in the union set using smartpca within the EIGENSOFT package [41,98,99]. Due to the substructure within vendor and vendor location that we saw in our PCA analyses, we chose to estimate the pairwise F ST between all seven sample subgroups defined above. Linkage disequilibrium We plotted the decay of linkage disequilibrium (LD) using the==r2 utility in plink 1.9 and the procedures described in Parker et. al 2016 [15]. Briefly, each population’s curve included all SNPs with MAF > 20% and pairwise LD comparisons were restricted to SNPs with allele frequencies within 5% of each other. An average r2 estimate was obtained using 10,000 randomly selected SNP pairs from each 100kb interval for the distance between two SNPs, starting with 0-100kb and end at 9.9-10Mb. The procedure was repeated for cluster grouping. The CFW mice curve was downloaded from a repository for Parker et al. (http://datadryad.org/resource/doi:10.5061/dryad.2rs41) [15] and used for comparison as another commercially available, outbred rodent stock. LMM covariates and phenotype data pre-processing To select covariates for the GWAS, we performed univariate linear regression for each potential covariate for each PavCA metric. This was done separately for rats from each genetically-similar cluster defined previously. Any covariates that accounted for 1% or more of the variance for at least one PavCA metric were passed into multivariate model selection with the R package leaps [100]. Model selection with leaps was performed for all metrics for all days of testing, as well as the average of days 4 and 5. Out of the 66 models, all covariates that had surpassed the 1% threshold to reach this step were ultimately selected in at least 40% of the leaps models. The covariates included in the downstream GWAS analyses were rat’s age at testing, housing (binary–single or multiple), light cycle (binary–standard or reverse), and a set of binary ‘indicator’ variables to model the effects of different experimenters/technicians responsible for the phenotypic testing (S8 Table). All covariates were included in the LMMs for association testing by GEMMA, rather than being regressed out prior to GWAS. Many of the PavCA metrics were exceedingly non-normally distributed. In most cases, this was expected due to how the behaviors were measured and defined. For example, the rats only had a window of 8 seconds in each trial during which to contact the lever and/or make a nose poke into the food magazine. All values for “average latency to lever press” or “average latency to magazine entry” were therefore necessarily between 0 and 8 seconds. As is typical for latency traits, many of the values were near 0 or exactly 8. Similarly, the “probability” of lever press or magazine entry were very skewed towards the limits 0 and 1, especially after conditioned responding had been established on the later test days of training. Given these unusual distributions, we chose to quantile normalize all metrics within each subgroup prior to association testing, accepting a possible loss in power since samples with identical values are ranked randomly during the quantile normalization procedure. SNP-based heritabilities Heritabilities were estimated separately for each vendor and subgroup using their sets of filtered SNPs. We used the SNP sets to construct genetic relationship matrices (GRMs) for each cluster using GCTA [101]. We then used the restricted maximum likelihood (REML) approach within GCTA on the GRMs, covariates, and quantile normalized PavCA data to calculate the SNP-based heritabilities for each metric on each testing day. GWAS We used GEMMA v0.97 [102], which implements an LMM for GWAS analysis. We included a GRM as a random term to account for relatedness and population structure. Though beneficial for preventing false positive associations, GRMs can also reduce power to detect QTLs in populations with greater levels of LD; this is due to proximal contamination [103,104]. To avoid this reduction in power, we used the leave-one-chromosome-out (LOCO) approach when constructing the GRMs for each cluster of samples [44,105,106]. As described above, we selected covariates that were included as fixed effects in our model (S8 Table). Any samples missing measurements for a given metric were removed from that analysis. For all GWAS, genotypes were represented as dosages (continuous [0,2]) in lieu of ‘hard’ genotype calls [0, 1, 2] to account for uncertainty in the genotype calls. Reported p-values come from the likelihood-ratio test (LRT) performed by GEMMA. Results were plotted using a custom R script. GWAS meta-analysis We performed meta-analyses of the GWAS results across all seven subgroups, across only the four Charles River subgroups, and across only the three Harlan subgroups. Due to the significant genetic differentiation between certain subgroups in this meta-analysis, it is possible that tested loci would have differing effect sizes and directions of effect. To account for this and maximize our power, we utilized the software MR-MEGA ver.0.2, originally designed to perform trans-ethnic meta-regression in humans [47]. A large number of the SNPs called in each subgroup were either unique to that singular group or a subset of the groups, making them of low utility in a meta-analysis of all seven subgroups. Therefore, the seven subgroup meta-analyses were performed on the intersection SNPs set for all seven clusters, amounting to 64,442 SNPs. We repeated the meta-analyses for the only the subgroups derived from Charles River (R04, R09, P09, C72) and only those from Harlan (202A, 206, 217) to allow for more SNPs to pass through to the meta-analysis. There were a total of 198,817 shared SNPs between the Charles River subgroups, but only 83,120 shared between the Harlan subgroups. Genome-wide significant results from these meta-analyses are presented in S1 File. For all significant hits, we also report the directionality and strength of effect of the SNPs in all the separate subgroups GWAS. P-value significance thresholds In human GWAS, 5x10-8 is widely used as a significance threshold [107]. However, model organisms have vastly different levels of LD, meaning that the effective number of independent tests differs between studies. Therefore, many prior studies have used permutation testing [108,109], where phenotype data is shuffled with respect to fixed genotypes. The computational load of such methods becomes intractable for studies with large sample sizes and several traits being mapped [103]. Thus, we used the sequence of SLIDE [110,111] and MultiTrans [45] to obtain significance thresholds. We used separate thresholds for each subgroup because the number of SNPs, the LD structure, and the marker-based heritabilities were different, all of which affect the estimated number of independent tests performed as part of the GWAS. An advantage of this approach is that only one threshold was needed per cluster for all normalized metrics. We used a sliding window of 1000 SNPs and sampled from the multivariate normal 10 million times to obtain a 0.05 significance threshold. The subgroup-level -log 10 (p) thresholds for Charles River R04, R09, P09, and C72 were determined to be 6.14, 6.13, 6.11, and 6.11, respectively. Similarly, the Harlan 202A, 206, and 217 thresholds were 6.09, 6.00, and 5.89, respectively. Since there is no clear extension of this approach for a meta-analysis with largely varying sets of SNPs and LD patterns, we ultimately obtained the meta-analysis thresholds by taking the subgroup from each vendor with the lowest level of LD (Charles River R04 & Harlan 202A) and filtering their SNP set to contain only the SNPs included in each meta-analysis. We then reran MultiTrans using these pruned sets of SNPs, providing us with a -log 10 threshold of 5.574 for the over-arching meta-analyses, 5.574 for the within-Harlan meta-analyses, and 5.993 for within-Charles River. Each set of meta-analyses (Charles River, Harlan, or overall) consisted of 55 PavCA metric x Training Day combinations. A standard Bonferroni correction applied to the meta-analysis thresholds stated above yields revised thresholds of 7.314 for the over-arching and Harlan meta-analyses and 7.73 for the Charles River meta-analysis. [END] --- [1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010234 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/