(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ GPRuler: Metabolic gene-protein-reaction rules automatic reconstruction ['Marzia Di Filippo', 'Department Of Statistics', 'Quantitative Methods', 'University Of Milan-Bicocca', 'Milan', 'Sysbio Centre Of Systems Biology', 'Chiara Damiani', 'Department Of Biotechnology', 'Biosciences', 'Dario Pescini'] Date: 2022-01 The core ability of GPRuler is to determine the GPR rule associated with each reaction. It follows that determining the right relationships among the involved genes takes priority in validating the tool. To assess the degree of confidence of the reconstructed GPR rules and hence the accuracy of GPRuler, we exploited curated metabolic models as ground truth. According to the type of relationships established among the genes involved in a given reaction, GPR rules can be categorized into five classes. In the simplest case, reactions are associated with empty rules (here labelled as “No gene” rules) when no gene is involved in reaction catalysis, or to single gene rules (here labelled as “One gene” rules) when a unique gene is responsible for reaction catalysis. In these cases, minimum effort from GPRuler is required because the final GPR rule will simply correspond to an empty string or to the unique responsible gene. An active role of our approach is played over more complex situations when multiple genes are involved in reaction catalysis generating the here labelled as “Multi gene” rules, where the right relationships among the involved genes need to be determined. In particular, “Multi gene” GPR rules can be characterized exclusively by OR operators among genes (here labelled as “OR” rules), by AND operators (here labelled as “AND” rules), or by both AND and OR operators in the most intricate scenarios (here labelled as “Mixed” rules). We executed GPRuler by using a workstation equipped with Intel Xeon 2.60GHz cpus. GPRuler performance evaluation via a comparison with ground truth GPR rules Following the generation of the rules by GPRuler, we evaluated their accuracy by comparison with ground truth GPRs. We considered as ground truth the GPRs of four extensively curated models relative to well known organisms, namely HMRcore [36], Recon3D [37], Yeast 7 (version 7.6.0) [38] and Yeast 8 (version 8.3.4) [39]. To evaluate whether a rule reconstructed by GPRuler coincides with the corresponding ground truth rule, we compared their truth tables. The truth table consists of one column for each of the N Boolean input variables, for 2N rows, one for each of their possible combination. The final column stores the evaluated value of the GPR for each row. We considered each pair of GPRs as identical only if their truth tables are identical. We labelled these identical cases as “Perfect match”. It is worth noting that comparison of rules exceeding 20 genes was performed manually because of the computational cost. In the opposite case, when at least one row of the table differs (“negative match”), the discrepancy can be imputed either to differences in the set of genes involved in the two GPRs under comparison and/or to different Boolean operators connecting them. The first scenario can be excluded by means of the Jaccard index. The Jaccard index is defined as the cardinality of the intersection set divided by the cardinality of the union sets, and quantifies the gene coverage of each reconstructed GPR rule as compared to the ground truth. The Jaccard index ranges between 0, if any overlap is observed, and 1, when the two sets coincide. For reactions having Jaccard index of 1, we used normalized Hamming distance between the two truth matrices to quantify the mismatch extent of their GPRs in terms of the included operators. The normalized Hamming distance is the number of different positions in the two vectors over the length of the two vectors. It ranges between 0, if the two truth tables are opposite, and 1, if the two truth tables are identical. In Fig 4, we reported the Jaccard index computed for all the retrieved negative matches. As revealed by the distribution of their Jaccard index in all the four ground truth models, we obtained mostly low index values and very few cases with a Jaccard index of 1. This implies that the discrepancy in the two negative match GPRs is almost always due to differences in the set of involved genes. The complete list of Jaccard indexes and Hamming distances obtained for each tested model are included in the S2–S5 Files. Together with the reaction specific scores, we also computed a global Jaccard index to quantify the general extent at which expected genes are present through all reactions. The previously discussed prevalence of low values of Jaccard index in the negative matches is reflected in a low global Jaccard index for the four ground truth models. In detail, we obtained a Jaccard index of 0.36 in HMRcore, 0.55 in Recon3D, 0.63 in Yeast 7 and 0.65 in Yeast 8. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 4. Evaluation of GPRuler performance in the obtained mismatches when compared with ground truth GPRs. The blue histograms on the left of each panel show the relative frequencies distribution of Jaccard indexes computed for the retrieved negative matches in all the four ground truth models. Specifically: HMRcore in Panel A; Recon3D in Panel B; Yeast 7 in Panel C; Yeast 8 in Panel D. We reported in the green histograms on the right of each panel the normalized Hamming distance between the two truth matrices of negative matches having Jaccard index of 1. https://doi.org/10.1371/journal.pcbi.1009550.g004 For all negative matches, we manually checked the annotations of the involved gene products available in the Uniprot database and in organism specific databases to determine whether the differences observed in the GPRuler outputs are more or less in line with the underlying biology, as compared to the GPRs used as ground truth. In detail, we adopted information coming from GeneCards [40] and HGNC [41] databases for HMRcore and Recon3D models, and The Saccharomyces Genome Database (SGD) [42] for Yeast 7 and Yeast 8 models. In case all the genes and relative interactions in the ground truth GPR were consistent with biological knowledge and was not properly accounted by GPRuler, we labelled the GPR as “Not automatically reconstructed by GPRuler”. On the contrary, if some of the genes in the ground truth rule are wrongly associated to the reaction or improper Boolean operators are used between genes, whereas GPRuler rule is in line with the underlying biology, we labelled the GPR as “Corrected by GPRuler”. Finally, we considered as automatically reconstructed by GPRuler (“Automatic” labelled) both GPRs that were labeled as “Perfect match” and as “Corrected by GPRuler”. On the contrary, those rules that cannot be correctly reconstructed by GPRuler unless a subsequent curation by the user are referred to as “Not automatic”. The proportion of Automatic and Not automatic rules is shown in Fig 5A, respectively, as yellow and blue bars. In the following, we detailed the results obtained for the four ground truth models. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 5. Assessment of GPRuler performance in reconstructing GPRs of ground truth models. Panel A shows a summary of GPRuler performance highlighting the percentage of the automatically reconstructed rules in each ground truth models (labelled as “Automatic” and coloured in dark yellow) against those that cannot be correctly reconstructed unless a subsequent curation by the user (labelled as“Not automatic” and coloured in light blue). The mosaic plots below show in B) HMRcore, C) Recon3D, D) Yeast 7 and E) Yeast 8 model the frequency of “Automatic” GPR rules as proportional to the size of internal rectangles. The rectangle portions having low transparency corresponds to the “Not automatic” rules. GPRs are classified according to the type of relationships established among genes involved in reaction catalysis as “No gene”, “One gene” and “Multi gene”. In the “Multi gene” class, the three subclasses “OR”, “AND” and “Mixed” are also represented. On the horizontal side of each mosaic plot, the proportion of “No gene”, “One gene” and “Multi gene” Automatic GPR rules in each model is reported. On the vertical side of each mosaic plot, the same information is reported for the three classes “OR”, “AND” and “Mixed” over the percentage of Automatic Multi gene rules. https://doi.org/10.1371/journal.pcbi.1009550.g005 Performance evaluation of GPRuler on a manually curated core model. We firstly assessed the performance of GPRuler in reconstructing the GPR rules of HMRcore model, which is a core model of central carbon metabolism that we extracted from the genome-wide HMR metabolic model [43] and we subsequently introduced and curated in [36, 44–46]. We decided to use this model as ground truth to evaluate the performance of our approach because of the manually curated GPR rules associated to its reactions. Moreover, given the huge size of genome-scale models and the difficulty to control all the included GPRs, starting from a core model allowed a tighter control of the output. The HMRcore model consists of 324 reactions that are mainly associated to single gene rules representing the 43.8%. The remaining rules are 23.8% classified as “No gene” and 32.4% as “Multi gene”. The application of GPRuler on the reactions included in HMRcore and the subsequent comparison with those stored in the original model produced a perfect match for 158 of the 324 examined reactions, corresponding to a total of 48.8%. Considering the category of each rule, we correctly inferred 77.9% of “No gene” rules (60 out of a total of 77), 47.2% of the “One gene” rules (67 out of a total of 142), 15.4% of “AND” rules (2 out of a total of 13), 37.7% of the “OR” rules (27 out of a total of 86) (29 out of a total of 77), and none of the 15 “Mixed” rules. Looking at the 166 wrongly obtained GPR rules (representing the 51.2% of total streactions of HMRcore reactions), 119 false negatives rules turned out to be replaceable with those generated through our methodology because in line with the information stored in the exploited databases. Following manual curation of the ground truth model, the percentage of correctly predicted GPR rules of HMRcore increased from 48.8% to 85.5 (as shown in Fig 5A), reaching the 100% of the GPRs that can be automatically constructed. In particular, we curated 17 rules afferent to the “No gene” category by increasing the corresponding coverage to 100% (77 out of a total of 77), 68 rules afferent to the “One gene” category by increasing the corresponding coverage to 95.1% (135 out of a total of 142), 1 rule afferent to the “AND” category by increasing the corresponding coverage to 23.1% (3 out of a total of 13), 33 rules afferent to the “OR” category by increasing the corresponding coverage to 80.5% (62 out of a total of 77), and none of the 15 “Mixed” rules. The remaining 14.5% of rules produced by GPRuler, depicted in Fig 5A as blue bar, corresponds to the cases labelled as “Not automatically reconstructed by GPRuler” and represents the obtained true mismatches. In Fig 5B, the frequency of Automatic rules is detailed for each class of GPRs. The execution time of GPRruler on HMRcore was 2 hours and 24 minutes. The complete list of GPR rules generated by GPRuler for HMRcore model is available in the S2 File. [END] [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009550 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/