(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Multivariate phenotype analysis enables genome-wide inference of mammalian gene function [1] ['George Nicholson', 'University Of Oxford', 'Oxford', 'United Kingdom', 'Mrc Harwell Institute', 'Harwell', 'Hugh Morgan', 'Habib Ganjgahi', 'Steve D. M. Brown', 'Ann-Marie Mallon'] Date: 2022-08 A factor analysis of the MV model’s fitted covariance structure identifies 20 clusters of phenotypes, with each cluster tending to be perturbed collectively. These factors cumulatively explain 75% of the KO-induced variation in the data and facilitate biological interpretation of perturbations. We also demonstrate that the MV approach strengthens the correspondence between IMPC phenotypes and existing gene annotation databases. Analysis of a subset of KO lines measured in replicate across multiple laboratories confirms that the MV model increases power with high replicability. There are 4,256 (1.4% of 302,997 observed data measurements) hits called by the univariate (UV) model analysing each phenotype separately, compared to 31,843 (10.5%) hits in the observed data results of the MV model, corresponding to an estimated 7.5-fold increase in power of the MV model relative to the UV model. One key property of the data set is its 55.0% rate of missingness, resulting from quality control filters and incomplete measurement of some KO lines. This raises the question of whether it is possible to infer perturbations at phenotype–gene pairs at which data are not available, i.e., to infer some in vivo effects using statistical analysis rather than experimentation. We demonstrate that, even at missing phenotypes, the MV model can detect perturbations with power comparable to the single-phenotype analysis, thereby filling in the complete gene–phenotype map with good sensitivity. The function of the majority of genes in the human and mouse genomes is unknown. Investigating and illuminating this dark genome is a major challenge for the biomedical sciences. The International Mouse Phenotyping Consortium (IMPC) is addressing this through the generation and broad-based phenotyping of a knockout (KO) mouse line for every protein-coding gene, producing a multidimensional data set that underlies a genome-wide annotation map from genes to phenotypes. Here, we develop a multivariate (MV) statistical approach and apply it to IMPC data comprising 148 phenotypes measured across 4,548 KO lines. Funding: This work was supported by Medical Research Council ( https://mrc.ukri.org/ ) Programme grant MC_UP_A390_1107 (G.N., H.G. and C.H.) and National Institutes of Health ( https://www.nih.gov/ ) grant U54 HG006370 (H.M., S.D.M.B., A.-M.M.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data and code used to generate the results in the paper are available at https://github.com/georgenicholson/multivariate_phenotype_data_and_code and https://zenodo.org/record/6787112 . Copyright: © 2022 Nicholson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. We validate our MV method in a number of ways. We evaluate the efficacy of inference in the presence of missing data by artificially masking data and comparing the masked data results to the fully observed data results. We independently assess the MV hit-calling method by examining the replicability of hits called on the same KO lines measured across multiple laboratories. We also perform a number of additional checks, around the biological reasonableness of the results, as well as assessing quantitative measures of model robustness and fit. Our checks indicate that the MV approach can substantially increase hit rates in the IMPC, while retaining error rate control and replicability, even when calling hits in cases of missing phenotype data. The development of a sensitive, replicable, and comprehensive gene–phenotype map will ensure that the number of animals used in follow-up experiments to the IMPC is minimised, in alignment with the 3Rs of replacement, reduction, and refinement [ 24 ]. A major goal of the IMPC is to create a comprehensive gene–phenotype annotation map. From a statistical perspective, this involves testing the null hypothesis that there is no phenotypic perturbation. Alongside the MV model, we design a permutation-based approach to hypothesis testing aimed at powerful inference under careful control and monitoring of false positive rates. Our approach is based on the generation of synthetic null KO lines by structured random resampling from control animals (details in Methods–Control of error rates). By analysing synthetic null lines alongside true KO lines, we are able to select significance thresholds for effective error rate control. We adopt a composable approach to MV modelling that is computationally attractive while effectively capturing the important variation in the IMPC data set. First, we fit a UV multilevel model [ 11 ] for each phenotype separately. Second, we take the UV model’s outputted effect estimates and standard errors and fit an MV model to these, building methodologically on the work of [ 22 , 23 ]. We contextualise and compare performance of our methods against the background of this existing work in Methods–Comparison with existing methods. So far, the literature on high-throughput phenotyping has focused exclusively on calling hits (testing for a perturbation) at one phenotype at a time, using so-called univariate (UV) models [ 11 , 12 ]. However, initial results from the IMPC have revealed strong correlation between perturbations at different phenotypes. Multivariate (MV) association methods have already proven successful in many genetic applications, such as genome-wide association studies [ 15 , 16 , 17 ] and multi-tissue eQTL studies [ 18 , 19 , 20 , 21 , 22 ]. This points to an opportunity for improving inference in the IMPC by sharing information across phenotypes using MV methods. In particular, when sample size is severely limited on ethical and financial grounds, the hope is that MV methods can computationally increase the information extracted from the data that are gathered. Further, in our IMPC data set, not all phenotypes are available on each KO line. This raises the question of whether it is possible to infer perturbations at (phenotype, KO line) pairs at which data are not available, i.e., to infer some in vivo effects using statistical analysis rather than experimentation. We set out to implement an MV model that can effectively perform this type of inference when some data are missing. Each point corresponds to 1 animal, with data from 2 KO lines—labelled g and —displayed alongside contemporaneous data from a large number of control (wild-type, or WT) animals in grey (see legend). Panels (a) and (b) show data from phenotypes p (Triglycerides) and (Body fat percentage), respectively. Our goal is to quantify the underlying expected perturbations of the red/blue coloured points from the rolling WT baseline (illustrated with a smooth black curve), in the presence of structured experimental noise. Annotated on the plot, to the right of the red/blue measurement data of each gene–phenotype pair, are the posterior mean estimates from the UV and MV models, with empty squares and with filled squares (see legend), along with error bars denoting ±2 posterior SDs. In the current paper, we combine the information in across multiple related phenotypes, such as p and , thereby generating improved estimators . The relative means and SDs of the UV and MV estimators shown in the plot are illustrative of their general properties—MV posterior means are shrunken towards zero (towards the black curve here) relative to the UV posterior means in 90.2% of cases (phenotype–gene pairs); while MV posterior SDs are smaller than UV posterior SDs in > 99.9% of cases. The data and code used to generate this figure are available at [ 13 , 14 ]. IMPC, International Mouse Phenotyping Consortium; KO, knockout; MV, multivariate; UV, univariate; WT, wild-type. In the IMPC adult phenotyping pipeline, a sequence of standardised measurements is performed on single-gene KO and control mice aged between 9 and 16 weeks. We refer to the measurements as phenotypes with these being measured in groups called procedures; all phenotypes within a given procedure are measured in a specific week of age ( S1 Fig ). The scientific purpose, experimental design, and detailed description for each procedure are presented at the IMPC website [ 9 ]. The primary scientific goal is to identify statistically significant KO-induced phenotypic perturbations, also referred to as phenotypic hits or positive annotations. The experimental design of the IMPC measures on average 14 animals (7 of each sex) from each KO line, contemporaneously with the rolling baseline of control animals. This is visualised for a pair of phenotypes at one of the phenotyping centres, MRC Harwell, in Fig 1 . The statistical goal is to estimate and test for a difference in phenotypic mean between each KO line and the shared set of control animals. Conceptually, an unpaired t test between KOs and controls is the basic statistical idea, but in practice, multilevel modelling is necessary due to the complex experimental structure. For example, litters and other experimental strata are occasionally confounded with the gene-KO effects of interest, necessitating the use of hierarchical models to identify effects of interest [ 10 , 11 , 12 ]. By March 2022, approximately 10,000 KO mouse lines, many for poorly understood genes, have so far been generated, and 8,623 of those lines have been phenotyped using standardised procedures for a wide variety of disease systems. In this paper, we analyse a partial IMPC data set comprising 4,548 KO lines with phenotype data from some of 148 quantitative phenotypes as of 26 March 2018. In excess of 300 measurements are conducted on each animal, ranging from clinical blood chemistry, through calorimetry and body composition, to behavioural phenotypes [ 2 ]. By inference from the multidimensional data sets produced, the IMPC is compiling a genome-wide annotation map from genes to phenotypes that is already providing unique insights into mammalian gene function and the genome landscape of diverse diseases [ 3 , 4 , 5 , 6 , 7 , 8 ]. The function of the majority of genes in the human and mouse genomes is unknown. Investigating and illuminating this dark genome is a major challenge for the biomedical sciences [ 1 ]. Developing a comprehensive catalogue of mammalian gene function will be a vital underpinning to studies of rare and common disease and advances in precision medicine. The International Mouse Phenotyping Consortium (IMPC) is a collaboration between 21 research institutions worldwide aimed at addressing the challenge of the dark genome through the generation and broad-based phenotyping of a knockout (KO) mouse line for every protein-coding gene ( www.mousephenotype.org ). Results We have previously designed a UV Bayes linear multilevel model targeting the phenotypic perturbation of gene KO animals relative to wild-type (WT) animals [11]. We fit this model to each (phenotype, centre) combination separately, yielding an estimate (and SE) of the phenotypic perturbation, (and ), for each (phenotype p = 1,…,P, gene g = 1,…,G) pair at which measurements are available. Example data and estimates of are illustrated in Fig 1. In this paper, we develop an MV modelling framework, building on the methodological work of [22,23]. The method takes as input the UV results, ( ), and outputs MV estimates ( ) across all (phenotype p, gene g) combinations, including those pairs at which data are unavailable. The MV model is based on a covariance structure Σ allowing perturbations to be correlated across different phenotypes, as illustrated between Triglycerides and Body fat percentage in Fig 1. The method also incorporates a correlation matrix, R, to account for structure in experimental noise across phenotypes. A practically useful property of this 2-stage model is its composability, whereby results can be transferred efficiently between 2 different analyses or computational tools—here from an arbitrarily complex UV model to a highly structured MV model. We lay out the results in 3 conceptual stages. First, we provide high-level technical descriptions of the UV and MV models. Second, we characterise the IMPC hit calling results, contrasting the UV and MV models, and with a focus on demonstrating statistical power and replicability. Finally, we look to applications to demonstrate how the MV approach can illuminate relationships between phenotypic perturbations and underlying biological mechanisms, and do this relatively effectively compared to its UV counterpart. These examples provide extra evidence of the MV method’s validity and replicability by illustrating how its results make intuitive sense and are aligned with existing scientific knowledge. Univariate model The parameter of interest throughout is denoted by θ pg and represents the expected perturbation of the pth phenotype in the gth gene KO, relative to WT animals (Fig 1). This UV model, fitted only to data from KO line g accompanied by data from the entire rolling baseline of WT animals, takes the form of a linear multilevel model (or mixed-effects model): (1) where y i is the Box–Cox transformed [25] measurement of the pth phenotype on the ith mouse. The parameters in β adjust additively for sex, sex–genotype interaction, strain, investigator, and other experimental metadata, while day and litter effects are modelled hierarchically via α day and α litter with variance components and . In this paper, we focus on estimation of θ pg , the main effect of genotype g on phenotype p. In cases where genotype effects differ between sexes [5], θ pg is interpretable as the average of those sex-specific effects. Longitudinal changes in the measurement baseline are modelled using a penalised spline which features in both fixed and random components [26]. Noninformative priors are specified for θ pg , β and the σ r , with the model being fitted via Markov chain Monte Carlo (MCMC) and outputting samples from the marginal posterior distribution p(θ pg |y) (for further details, see S1 Note and [11]). The UV inference outputs an estimate and standard error for each θ pg , i.e., the posterior mean and posterior SD , respectively. We perform careful quality control of the UV results, conservatively filtering out (from downstream MV analysis) any centre–procedure combinations that exhibit anomalous longitudinal patterns in UV results; such patterns can be indicative of unmodelled experimental artefacts rather than the biological effects (S2 Fig). Next, to ensure that there are sufficient data at each phenotype, we apply a post-QC heuristic filter whereby we retain only those phenotypes with UV effect estimates for at least 500 KO lines. After QC and filtering, the UV estimates (and SEs) are scaled so that the have unit SD for each phenotype within each phenotyping centre and are then taken forward as input for the MV model. Multivariate model In collecting together the results of the UV multilevel model, we obtain unbiased estimates (and SEs ) for θ ·g that are affected by MV experimental noise, having the covariance structure . Further, the latent P-dimensional MV perturbations θ ·g tend to exhibit strong P×P covariance structure. These aspects of the data suggest a model following the form of [22,23]: (2) (3) where the parameters Σ s represent the covariance of θ ·g , i.e., of the expected phenotypic perturbation for a KO line, and the hyperparameter R models the correlation structure of the experimental noise. The are known diagonal matrices of standard errors outputted by the UV model. The density of the latent perturbations, p(θ ·g |Σ, π 1:M ), is an MV Gaussian mixture model with mixing probabilities π 1:M;1:S over a specified ladder of scales given by constants ω 1:M [22,27], and S≥1 covariance matrices Σ 1:S to be learned [23]. We relate our approach to [22,23] in more detail and compare performance in Methods–Comparison with existing methods. We constrain Σ s to factor-model form (see, e.g., [28]): (4) where W s is a P×K matrix, and Ψ s a diagonal P×P matrix having positive diagonal elements. We performed the full analysis for fixed K∈{15,20,30,40}. The results presented in the manuscript are for a choice of K = 20 and S = 1, selected with reference to false discovery rate–controlled (Fdr-controlled) hit rates [29,30], and cross-validated likelihood measures of model fit (see Methods–Comparison with existing methods). We take an empirical Bayes approach to inference in the MV model specified at (2)–(4). The experimental correlation hyperparameter, R, is estimated from synthetic null data and fixed at in advance [22]. The expectation–maximisation (EM) algorithm is used to obtain maximum a posteriori (MAP) estimates of hyperparameters Σ 1:S and π under flat priors (a derivation and further details of the EM algorithm are in S2 Note). Conditional on the MAP estimates , the posterior for θ ·g is available in closed form (see Methods–MV model when data are missing). Inference when data are missing Even for gene–phenotype pairs at which no data are measured, referred to here as missing data, the MV model can be used to infer gene KO effects via the correlation structure that exists between unmeasured and measured phenotypes. The MV model identifies perturbations in 4,819 (1.3% of 370,107 missing data cases), which compares favourably with the UV model’s hit rate of 1.4% on observed data. When missing data results are combined with the observed data results, the MV model detects a total of 36,662 perturbations, an 8.6-fold increase compared to 4,256 detected by the UV method. It is important to note that estimation of θ ·g when is only partially observed can be performed coherently provided the statistical model is well specified with respect to the underlying data generating mechanism, and the unobserved data are missing at random (MAR) [23,33,34]. While there is a large proportion of missing data, it is clear from Fig 2 that the bulk of data is missing in obvious blocks and is a result of certain measurements/procedures not being performed in some centres. In this context of certain centres systematically not performing a subset of measurements, the MAR assumption is reasonable, in that the missing data mechanism, “given the missing data and the value of the observed data, is the same for all possible values of the missing data.” [33]. In spite of this reassuring observation, there is naturally still going to be some relatively small proportion of data that violate the MAR assumption in such a large and complex data set as this. We therefore perform additional checks on how practically reasonable the MAR assumption is. These are described in Results–Validating replicability (with reference to Fig 6C and 6D), and in Methods–Predicting masked data. Our recommendation to practitioners is carefully to examine the appropriateness of the MAR assumption in their particular context in the light of the work of Rubin and colleagues [33,34]. If there are any doubts about the MAR assumption’s validity, we recommend further empirical checks. In particular, the cross-validated mask and predict approach described in Methods–Predicting masked data can be implemented in a wide variety of MV datasets with missing data, and we recommend this as a tool for checking accuracy post hoc when the rate of missingness is high. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 6. Replicability validation scatterplots comparing results across phenotyping laboratories. Each panel examines a different type of comparison of a pair of replication results: (a) UV model vs. UV model, (b) MV model (measured) vs. MV model (measured), (c) MV model (missing) vs. MV model (missing), (d) MV model (measured) vs. MV model (missing). We examine the interlaboratory agreement for the KO reference lines by scatterplotting scaled z-statistics, , for the same KO line but measured in different laboratories. Significant perturbations correspond to , as delimited on the graphs with dashed red lines. Each point in the plots corresponds to 2 different laboratories measuring the same phenotype on the same KO line. The most informative cases for estimating the false sign rate (Fsr) occur when both laboratories detect significant perturbations, which correspond to points lying in the blue/red shaded regions on the scatterplot. The laboratories agree in the blue shaded regions but disagree in the red shaded regions. estimates (95% CIs) are shown at the top of each panel and are based on the level of agreement/disagreement observed in the shaded regions (Methods–Replicability and false sign rates). Counts (%) for each significance combination are superimposed; while the axes extend to [–3, 3], the counts apply to all data, including those beyond the plot’s scale. The data and code used to generate this figure are available at [13,14]. Fsr, false sign rate; KO, knockout; MV, multivariate; UV, univariate. https://doi.org/10.1371/journal.pbio.3001723.g006 Validating replicability We validate the UV and MV results by leveraging the multilaboratory nature of the experimental data. As part of the IMPC, a small number of KO lines have been measured multiple times across several labs, blind to their special status, i.e., the same gene KO, phenotyped in multiple labs; we refer to these as the reference lines. We analyse them under the UV and MV models while ensuring the models are blind to their correspondence to one another as replicated samples. After analysis, we reveal the reference lines and examine the replicability of findings on the same reference line across multiple phenotyping centres. S5 Fig plots the annotation results for the reference lines under the UV and MV models. This visually reinforces the impact of the MV model: It strongly increases the hit rate (denoted by a higher density of crosses) and does so in a replicable way. The directionality of pair-wise reference line hits of the MV model is concordant in 295 cases and discordant in 7 cases. Observed levels of replicability can be usefully interpreted in terms of a corresponding false sign rate (Fsr), described in Methods–Replicability and false sign rates and estimated using the IMPC reference lines. We attain a reassuringly low global estimate of = 1.2% (95% CI: 0.6% to 2.4%) for the MV model. Fig 6 provides further insights into the degree of replicability across laboratories in the reference line replicates. The blue/red shaded regions in each panel contain instances where results respectively agree/disagree across laboratories. The MV model (Fig 6B, 6C and 6D) identifies more perturbations than the UV model (Fig 6A) and does so with a high level of replicability, as measured by the small number of points in the red shaded regions and quantified by Fsr estimates ( ) shown at top of each panel corresponding to the results shown in that panel. Importantly, the level of agreement across laboratories is good regardless of whether the data were missing (Fig 6C) or measured (Fig 6B) or were measured in one laboratory but not in the other (Fig 6D). Comparison with IMPC database We compare the signed phenotype calls of our UV and MV models to the existing calls in the IMPC database, which are based on a different UV method [12]. The hit rate in the relevant subset of the IMPC database is 1.9%, while our UV model hit rate is 1.4% and our MV model hit rate on measured data is 10.5%. It is not straightforward to make direct comparisons with the IMPC database hit rate, owing to differences in error rate control (nominal p<10−4 in the IMPC database versus Fdr < 5% for our UV and MV models). However, when we inspect the concordance of our methods with the existing database, we see good agreement (Table 1), pointing to effective error rate control in all cases. Our UV model agrees with the IMPC database in all cases where both call a significant phenotype hit. Our MV model disagrees with the IMPC database in only 3 cases (0.1% of instances where they both call a hit). We examine these disagreements in more detail in Methods, where we conclude there to be little evidence among these 3 cases of either model outperforming the other. PPT PowerPoint slide PNG larger image TIFF original image Download: Table 1. Comparison of signed hits with the existing IMPC database. (a) UV model; and (b) MV model. Each model is compared to the corresponding hits called in the existing IMPC database (top). We represent calls by a number in {−1,0,1}, with 1 and −1 denoting significant positive and negative phenotypic perturbations, respectively, and 0 denoting a lack of statistical significance. https://doi.org/10.1371/journal.pbio.3001723.t001 Heterozygotes versus homozygotes For some genes in the IMPC, both the heterozygote and homozygote KO lines are measured. It is biologically reasonable that heterozygote and homozygote phenotypic perturbations, should they exist, are likely to act in the same direction. We can therefore compare the heterozygote/homozygote pairs for concordance in results (S6 Fig). As is expected biologically, homozygote lines are called as hits more frequently by the MV model (7.6%) than the corresponding heterozygote lines (2.3%). In cases where both the heterozygote and homozygote lines for a gene are called as hits, we observe directional concordance in 594 cases and discordance in only 46 cases. Under the assumption that all heterozygote/homozygote pairs truly perturb the phenotype in the same direction, then this level of discordance is consistent with of 3.7% (95% CI: 2.8% to 5.0%). This low Fsr estimate contributes further evidence that our control of false positive rates in hit calling is effective, adding to the evidence provided by the reference lines replicability analysis. In reality, there may be exceptions whereby the heterozygotes and homozygotes actually perturb the phenotype in different directions, in which case this zygosity-based estimate may still be usefully interpreted as an upper bound on the actual Fsr replicate . [END] --- [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001723 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/