(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . HypDB: A functionally annotated web-based database of the proline hydroxylation proteome [1] ['Yao Gong', 'Department Of Biochemistry', 'Molecular Biology', 'Biophysics', 'University Of Minnesota At Twin Cities', 'Minneapolis', 'Minnesota', 'United States Of America', 'Bioinformatics', 'Computational Biology Program'] Date: 2022-08 Proline hydroxylation (Hyp) regulates protein structure, stability, and protein–protein interaction. It is widely involved in diverse metabolic and physiological pathways in cells and diseases. To reveal functional features of the Hyp proteome, we integrated various data sources for deep proteome profiling of the Hyp proteome in humans and developed HypDB ( https://www.HypDB.site ), an annotated database and web server for Hyp proteome. HypDB provides site-specific evidence of modification based on extensive LC-MS analysis and literature mining with 14,413 nonredundant Hyp sites on 5,165 human proteins including 3,383 Class I and 4,335 Class II sites. Annotation analysis revealed significant enrichment of Hyp on key functional domains and tissue-specific distribution of Hyp abundance across 26 types of human organs and fluids and 6 cell lines. The network connectivity analysis further revealed a critical role of Hyp in mediating protein–protein interactions. Moreover, the spectral library generated by HypDB enabled data-independent analysis (DIA) of clinical tissues and the identification of novel Hyp biomarkers in lung cancer and kidney cancer. Taken together, our integrated analysis of human proteome with publicly accessible HypDB revealed functional diversity of Hyp substrates and provides a quantitative data source to characterize Hyp in pathways and diseases. Despite these advances, there is a lack of an integrated and annotated knowledgebase dedicated for Hyp, which underappreciates the functional diversity and physiological significance of this evolutionarily conserved metabolic-sensing PTM pathway. To fill the knowledge gap, we developed a publicly accessible Hyp database, HypDB ( http://www.HypDB.site ) ( S1 Fig ). The development of the HypDB provides 3 main features—first, a classification-based algorithm for confident identification of Hyp substrates; second, integrated resources based on exhaustive manual literature mining, large-scale LC-MS analysis, and curated public database; and third, a collection of a large spectral library for LC-MS-based site-specific identification from a variety of cell lines and tissues. Furthermore, stoichiometry-based quantification of Hyp sites allows quantitative comparison of site abundance across various proteins and tissues, and the extensively annotated Hyp proteome enables deep bioinformatic analysis, including network connectivity, structural domain enrichment, and tissue-specific distribution study. The online database system allows the community-driven submission of LC-MS datasets to be included in HypDB annotation and the direct export of precursor and fragmentation with spectral library that enables the development of targeted quantitative proteomics and data-independent analysis workflow. We hope that the HypDB will provide critical insights into the functional diversity and network of the Hyp proteome and aid in further mechanistic studies on the physiological roles of the metabolic-sensing PTM pathway in cells and diseases. In the past 2 decades, numerous studies driven by advances in mass spectrometry-based proteomics technology have reported the identification and characterization of diverse new Hyp targets and the important roles of the modification in physiological functions [ 24 – 29 ]. Hyp has been well known to affect protein homeostasis and the classic example is the PHD-HIF-pVHL regulatory axis. The similar mechanism also regulates the turnover of diverse key transcriptional, metabolic, and signaling proteins, including β2AR, NDRG3, ACC2, EPOR, G9a, and SFMBT1, etc. [ 30 – 34 ]. In addition to pVHL-mediated protein degradation, Hyp also regulates substrate degradation by affecting its interaction with deubiquitinases. For example, the hydroxylation of Foxo3a promotes substrate degradation by inhibiting the interaction with deubiquinase Usp9x, and hydroxylation of p53 enhances its interaction with deubiquitinases Usp7/Usp10 to prevent its rapid degradation [ 35 , 36 ]. P4H-mediated Hyp has also been known to regulate the stability of diverse substrates including AGO2 and Carabin [ 37 , 38 ]. In addition to protein degradation, Hyp can also affect protein–protein interaction to regulate signaling and transcriptional activities. For example, PKM2 hydroxylation promotes its binding with HIF1A for transcriptional activation, Hyp of AKT enhances the interaction with pVHL to inhibit the kinase activity of AKT, and PHD1-mediated hydroxylation of Rpb1 is necessary for its translocation and phosphorylation [ 39 – 42 ]. More recently, TBK1 hydroxylation was identified and found to induce pVHL and phosphatase binding, which decreases its phosphorylation and enzyme activity, while the loss of pVHL hyperactivates TBK1 and promotes tumor development in clear cell renal cell carcinoma (ccRCC) [ 27 , 43 ]. The most well-characterized Hyp targets are collagen proteins and HIFα family of transcription factors. Hyp on collagens mediated by P4Hs is critical to maintaining the triple-helical structure of the collagen polymer and enabling the proper protein folding after translation. Indeed, adding an electronegative oxygen on the proline 4R position promotes the trans-conformation and stabilizes the secondary structure of collagen [ 1 ]. Inhibition of collagen Hyp destabilizes the collagen and prevents its export from the ER, therefore inducing cell stress and death [ 13 – 15 ]. HIFα transcription factors are essential to mediate hypoxia-response in mammalian cells [ 16 – 18 ]. Hyp of HIFα proteins mediated by PHD proteins under normoxia condition is recognized by pVHL in the Cullin 2 E3 ligase complex, which leads to rapid ubiquitination and degradation of HIFα proteins [ 19 , 20 ]. Hypoxia condition inhibits HIFα Hyp and degradation, enabling the transcriptional activation of over 100 hypoxia-responding genes [ 21 – 23 ]. Proline hydroxylation (Hyp), first discovered in 1902, is an important protein posttranslational modification (PTM) pathway in cellular physiology and metabolism [ 1 – 4 ]. As a simple addition of a hydroxyl group to the imino side chain of proline residue, the modification is found to be evolutionarily conserved from bacteria to humans. In mammalian cells, Hyp is largely mediated through the enzymatic activities of 2 major families of prolyl hydroxylases—collagen prolyl 4-hydroxylases (P4HAs) [ 5 – 7 ] and hypoxia-induced factor (HIF) prolyl hydroxylase domain (PHD) proteins [ 8 – 12 ], while there are no known enzymes capable of removing protein-bound Hyp yet. Since the activity of prolyl hydroxylases depends on the cellular collaboration of multiple co-factors, including oxygen and iron, as well as several metabolites, such as alpha-ketoglutarate, succinate, and ascorbate, the Hyp pathway is an important metabolic-sensing mechanism in the cells and tissues. 2. Results 2.1. Database construction and analysis workflow To construct a bioinformatic resource for metabolic-sensing Hyp targets, we developed HypDB, a MySQL-based relational database on a public-accessible web server (Figs 1 and S2). It was constructed based on 3 main resources to comprehensively annotate human Hyp proteome (Fig 1). First, manual curation of literature through PubMed (searching term: “proline hydroxylation” and time limit between 2000 and 2021) was performed by 2 independent curators, which yielded 1,287 research journal articles. Site identification was extracted from each journal article, and its corresponding protein was mapped to UniProt protein ID if possible. Manual curation of the research articles focused on the sites that were biochemically investigated with multiple evidence including mass spectrometry, mutagenesis, western blotting as well as in vitro or in vivo enzymatic assays. Analyzed Hyp site identifications were then matched against the existing data in the database to reduce redundancy. Second, the database included extensive LC-MS-based direct evidence of Hyp site identifications based on the integrated analysis of over 100 LC-MS datasets of various human cell lines and tissues (see Experimental methods). The datasets were either downloaded from publicly accessible server or produced in-house. Each dataset was analyzed through a standardized workflow using MaxQuant search engine, and the Hyp site identifications were filtered and imported into the HypDB with a streamlined bioinformatic analysis pipeline specified in details below. Our collection of MS-based evidence of Hyp identifications from cell lines and tissues likely revealed a significant portion of Hyp sites that can be potentially identified by deep proteomic analysis as evidenced by our observation that the rate of unique Hyp site addition from each dataset decreased significantly despite the increased collection of datasets in the database (S2B Fig). Third, the HypDB also integrated Hyp identification annotated in the public UniProt database. For better clarification, the database records indicate whether the site was uniquely reported by the UniProt database or by both UniProt annotation and evidence from large-scale LC-MS analysis. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. Workflow of establishing HypDB database and webserver. HypDB was constructed through deep proteome profiling analysis of human tissues and cell lines, manual literature mining, and integration with UniProt data source. Classification-based algorithm was applied to extract confident identifications, and site-specific bioinformatic analysis with stoichiometry-based quantification revealed the biochemical pathways involved with human Hyp proteome. MS-based Hyp library further enabled DIA-MS quantification of Hyp proteome in cells and tissues. DIA, data-independent acquisition; Hyp, proline hydroxylation. https://doi.org/10.1371/journal.pbio.3001757.g001 We implemented stringent criteria for data importing and classification from LC-MS-based identifications. To import data into the HypDB, LC-MS-based identification of Hyp site from database search analysis was first analyzed by a classification-based algorithm to determine the confidence of Hyp site identification and localization (Fig 2A). The classification was performed using the best scored MS/MS spectrum of a Hyp site in each dataset analysis. The algorithm classified Hyp identifications that can be exclusively localized to proline residue based on consecutive b- or y-ions as Class I sites. The algorithm classified the Hyp identifications that cannot be exclusively localized based on MS/MS spectrum analysis but can be distinguished from 5 common types of oxidation artifacts (methionine, tryptophan, tyrosine, histidine, phenylalanine) mainly induced during sample preparation as Class II sites. Other Hyp identifications that were reported by the MaxQuant database search software (with 1% false-discovery rate at the site-level and a minimum Andromeda score of 40) were grouped as Class III sites. We further developed a site-localization score using the relative intensities of key fragment ion to index the level of confidence in site localization with MS/MS spectrum analysis for Class I and Class II sites (Experimental methods). Each dataset was analyzed by the classification algorithm separately, and the best classification evidence for each Hyp site was selected and reported on the HypDB website to indicate the confidence of site localization. The classification-based algorithm provides the specificity and reliability required for an accurately annotated database while maintaining all possible identifications as searchable records. And the localization credit score distribution of Class I and Class II sites were shown in S2C and S2D Fig. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 2. Substrate diversity of the human Hyp proteome. (A) Illustration of classification-based algorithm to identify confident Hyp sites. (B)Venn diagram of Class I, II, III Hyp sites identified from MS analysis and manually curated UniProt sites. (C) PTM regulatory enzymes identified as Hyp substrates. (D) Kinase tree classification showing the distributions of kinases as Hyp substrates in different kinase families, including AGC (named after PKA, PKG, PKC families), CAMK (leaded by calcium/calmodulin-dependent protein kinases), CK1 (cell kinase 1), CMGC (named after CDKs, MAPK, GSK, CLK families), STE (homologs of the yeast STE counterparts), TK (tyrosine kinases), and TKL (tyrosine kinase-like). (E) Hydroxyproline proteins that interact with EGLN1. (F) Hydroxyproline proteins that interact with P4HA2. Refer to Sheet A in S2 Table and Sheet A–G in S3 Table for the underlying data of Fig 2B–2F. Hyp, proline hydroxylation; PTM, posttranslational modification. https://doi.org/10.1371/journal.pbio.3001757.g002 To evaluate the site-specific prevalence of Hyp, a stoichiometry-based quantification strategy was integrated into the analysis workflow using the previously established principles [27,44]. Briefly, the Hyp stoichiometry was calculated by dividing the summed intensities of the peptides containing the Hyp site identification with the total intensities of the peptides containing the same proline site in the dataset. HypDB recorded all available site-specific Hyp stoichiometry analysis from various cell lines and tissues, which allowed site-specific quantitative analysis of modification abundance across cell and tissue types. And the median stoichiometry of all stoichiometry measurements for any specific site was calculated and reported on the HypDB website. To further explore the functional association of Hyp proteome, several bioinformatic annotation strategies were integrated into the analysis workflow as a part of the data importing process. These stand-alone workflows include evolutionary conservation analysis, solvent accessibility analysis, and protein–protein interface analysis. Evolutionary conservation analysis compared the conservation of Hyp sites with other proline sites on the same protein and performed a statistical test to determine if the Hyp site is more evolutionarily conserved than non-Hyp sites. Solvent accessibility analysis analyzed the sequence of the substrate protein with DSSP package and calculated the likelihood of solvent accessibility for each Hyp sites. Protein–protein interaction interface analysis extracted the domain interaction residues from the 3DID database based on PDB structure analysis and matched them against the Hyp site in the database to identify the Hyp site that is localized in the interface and more likely to interfere with protein–protein interaction. All information above was integrated into several tables and linked through foreign keys as the schema in S2A Fig. Complete information on all Hyp sites was organized in 2 major tables including a redundant site table (S1 Table), which stored all Hyp sites identified in different tissues and cell lines including annotated MS/MS spectra, site-specific abundance and sample source information, and a nonredundant site table (Sheet A in S2 Table), which merged the LC-MS-based evidence from different sources at the site-specific level and also integrated with the sites collected from UniProt and manual curation of literatures. 2.2. Validation of the Hyp site classification strategy To validate our classification-based strategy for confidence Hyp site identification, we performed comparative analysis of Hyp site identifications from each class with manually curated UniProt Hyp identifications. Our analysis showed that the Class I sites alone covered over 60% sites annotated in the UniProt, and a combination of Class I and II sites covered about 63% of the UniProt sites, while very few UniProt annotated sites overlapped with the Class III sites (Fig 2B), suggesting that our Hyp site localization and classification algorithm allowed the collection of highly confident Hyp identification and significantly improved the reliability of LC-MS-based Hyp site analysis. To further probe the current state of the Hyp proteome, we performed extensive bioinformatic analysis for functional annotation of the Hyp proteome based on more confident Hyp site identifications in HypDB, which excluded Class III only Hyp sites whose LC-MS evidence cannot distinguish them from potential oxidation artifacts. [END] --- [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001757 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/