(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ Network-based protein-protein interaction prediction method maps perturbations of cancer interactome ['Jiajun Qiu', 'Shanghai Children S Hospital', 'Shanghai Institute Of Medical Genetics', 'Shanghai Jiao Tong University', 'Shanghai', 'Nhc Key Laboratory Of Medical Embryogenesis', 'Developmental Molecular Biology', 'Shanghai Key Laboratory Of Embryo', 'Reproduction Engineering', 'Kui Chen'] Date: 2022-01 Abstract The perturbations of protein-protein interactions (PPIs) were found to be the main cause of cancer. Previous PPI prediction methods which were trained with non-disease general PPI data were not compatible to map the PPI network in cancer. Therefore, we established a novel cancer specific PPI prediction method dubbed NECARE, which was based on relational graph convolutional network (R-GCN) with knowledge-based features. It achieved the best performance with a Matthews correlation coefficient (MCC) = 0.84±0.03 and an F1 = 91±2% compared with other methods. With NECARE, we mapped the cancer interactome atlas and revealed that the perturbations of PPIs were enriched on 1362 genes, which were named cancer hub genes. Those genes were found to over-represent with mutations occurring at protein-macromolecules binding interfaces. Furthermore, over 56% of cancer treatment-related genes belonged to hub genes and they were significantly related to the prognosis of 32 types of cancers. Finally, by coimmunoprecipitation, we confirmed that the NECARE prediction method was highly reliable with a 90% accuracy. Overall, we provided the novel network-based cancer protein-protein interaction prediction method and mapped the perturbation of cancer interactome. NECARE is available at: https://github.com/JiajunQiu/NECARE. Author summary Protein-protein interaction (PPI) network is the biological foundation for the normal function of cells, while the perturbation of this network can result in the pathological state, such as cancer. Notably, the perturbation of PPI network in cancer not only involves in the destruction of old PPI, but also the reconstruction of new PPIs. However, due to the limit of tools, instead of the real physical interaction between proteins, previous cancer network researches only focus on the co-expression relationships. Now, with the development of computational biology, we established a novel cancer specific physical PPI prediction method dubbed NECARE, which was based on relational graph convolutional network (R-GCN) with knowledge-based features. It can infer the PPI in cancer from a general network. And we reveal the cancer PPI interactome by doing high-throughput analysis with NECARE. Also, many cancer hub genes were identified during the analysis, which were enriched for cancer network perturbations. Future studies can benefit from both our method itself and the results of our analysis. Citation: Qiu J, Chen K, Zhong C, Zhu S, Ma X (2021) Network-based protein-protein interaction prediction method maps perturbations of cancer interactome. PLoS Genet 17(11): e1009869. https://doi.org/10.1371/journal.pgen.1009869 Editor: Jianfeng Pei, Peking University, CHINA Received: February 22, 2021; Accepted: October 9, 2021; Published: November 2, 2021 Copyright: © 2021 Qiu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All relevant data are within the manuscript and its Supporting Information files. Funding: J.Q. was Sponsored by Shanghai Sailing Program (21YF1424200) and was supported by the the program of the China Scholarships Council (CSC201606230244, https://www.csc.edu.cn/) and by the grants from the National Key Research and Development Program of China (2019YFA0801402), the National Natural Science Foundation of China (81971421), Shanghai key clinical specialty project (shslczdzk05705). C.Z was supported by a grant from the Outstanding Leaders Training Program of Pudong Health Bureau of Shanghai (PWR12018-07, http://www.pudong.gov.cn/wjw/) and the Key Discipline Construction Project of Pudong Health Bureau of Shanghai (PWZxk2017-23, PWYgf2018-05, http://www.pudong.gov.cn/wjw/). X.M. was supported by the the program of the China Scholarships Council (CSC201708080070, https://www.csc.edu.cn/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Introduction Cells are biological systems that employ a large number of genes and signaling pathways to coordinate multiple functions [1]. Therefore, instead of acting in isolation, genes interact with each other and work as part of complex networks [2]. The completeness of these networks is the foundation of the normal biological systems, while perturbation of them can result in the pathological state. Recent studies have already found network perturbation is the cause of cancers, rather than the dysregulation of single proteins [2]. Protein network in cancer is perturbed by many factors, one of which could be mutations. Disease-causing mutations can not only produce a mutated gene and thus a mutated protein, but also disturb the interactions between the mutated protein and its normal molecular partners [3]. Additionally, distinct mutations will cause different molecular defects in proteins, and they may lead to distinct perturbations of protein networks, giving rise to distinct phenotypic outcomes [4]. Nonsense mutations that grossly destabilize a protein structure can be modeled as removing a protein node from the network (Fig 1A). Alternatively, missense mutations may give rise to partially functional protein products with specific changes in distinct biophysical or biochemical interactions (Fig 1B) [4]. Furthermore, studies have already found that missense mutations in cancer are more likely to occur on the interaction interface of proteins. Thus, network perturbation, instead of single protein dysregulation, has been found to be the reason for human diseases, especially cancers [5]. For example, in cancer, TP53, a well-known tumor suppressor protein (Fig 1C), loses many interactions with other important proteins, such as PTEN and MDM2 [6]. However, new proteins, such as CDK4, have been discovered to interact with TP53. In the normal network, the cross-talk line from TP53 to CDKN2A is TP53-MDM2-CDKN2A, but in cancer, the cross-talk line is TP53-CDK4-STK11-CDKN2A [7]. Therefore, in cancer, mutations lead to reconstruction of the protein network rather than the simple destruction, making the protein network in cancer tissues very different from that in normal tissues. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. Illustration of the perturbation of the protein relationship network and NECARE algorithm. Panel A-C introduce the concept of protein network perturbation. (A) Each node represents a protein. Mutations such as nonsense mutations could cause the node to be totally inactive or absent (red) and lose all the edges connected to this node (gray dashed edges). (B) Each node represents a protein. Mutations such as missense mutations could cause the gain or loss of specific edges (purple edges mean the new gained edges due to the mutations; gray dashed edge means lost interaction), while the center node is not totally inactive. (C) This is an example of the perturbation of the protein relationship network in cancer. The example is based on the KEGG database (6). Gray dashed edges are the interactions that are lost in cancer, and purple edges are the new interactions in which genes are involved in cancer. Panel D is a simple example to show how we represent the gene (red node) by NECARE with R-GCN. Nodes a-e and the red node represent different genes, and the red node is set as the target gene. Nodes a-e are all in contact with the red node, and different colored edges represent different types of interactions. First, each node is represented by a feature vector that contains three parts: (tan: OPA2Vec; salmon: TCGA-based expression feature; and taupe: TCGA-based mutation feature). Then, to represent the red node, the feature vectors are gathered and transformed for each relation type individually (for both in- and out-edges; also, a self-loop is included). The resulted representation (vertical rectangles with different colours for different relationship types) is summed up and passed to an activation function (ReLU). https://doi.org/10.1371/journal.pgen.1009869.g001 There have been some studies about cancer network perturbations [2,8–11]. For example, James West et al. tried to identify genes with network perturbations by calculating the network entropy [10]. Maxim Grechkin et al. also identified perturbed genes through inferred gene regulators and their expression [2]. As these studies were based on only the coexpression of genes, their network was more likely to reflect the relationships (expression and repression) between transcriptional factors and their targets. However, these studies failed to consider physical relationships such as protein-protein interactions (PPIs), which are significantly different from coexpression networks based on topological comparisons [12]. As to PPIs, there has already existed different kinds of PPIs prediction methods, but they are only for non-disease situation. Generally, they fall into three categories: 1) Structure-based methods, which are based on the 3D structure of proteins and limited to proteins with PDB structures [13–16]. Structure-based methods are better at predicting physical interactions. 2) Sequence-based prediction methods, which attempt to predict interactions by the sequences of two candidate proteins [17–20]. 3) Network-based methods that predict interactions based on the known network. Unlike other methods which only consider two candidate proteins, network-based methods also consider their known neighbors [21–27]. In our study, we established a novel cancer PPI prediction method, dubbed NECARE (network-based cancer PPI prediction), to investigate the whole cancer PPI map. Here we applied a relational graph convolutional network (R-GCN) with knowledge-based features. One crucial novelty of this work is that, unlike previous network-based node relationship prediction algorithms, NECARE considers the type and direction of gene links at the input space, so that NECARE is able to infer the possible PPIs through gene relationships such as activation, expression, and phosphorylation. And NECARE was found to outperform the other algorithms (both network- and sequence-based algorithms) in predicting cancer PPIs. Thus, our tool can help other researchers to determine the possible upstream and downstream molecular partners of their target proteins in cancer. Furthermore, we mapped the cancer interactome and analyzed the perturbations of PPIs in cancer with NECARE. We found that the PPI perturbations were enriched in some specific genes that were defined as cancer hub genes in our study. These hub genes were significantly related to the prognosis of 32 types of cancers. Many of these hub genes have already been well studied in previous cancer studies or served as drug targets. These findings indicated that our results can potentially provide the targets for future cancer studies. Finally, we selected 20 pairs of PPIs and verified the interaction of 18 pairs by coimmunoprecipitation, which demonstrated that NECARE prediction method was highly reliable with a 90% accuracy. Discussion Previous studies have already found that somatic missense mutations were significantly enriched in PPI interfaces compared to non-interfaces and those mutations would have “edgetic” effect to alter the PPIs [37,38]. Meanwhile, some other study confirmed several co-expression network perturbations in cancer [2]. All these results indicated that the PPI network in cancer might be different from that in non-disease situations. In our study, we used R-GCN to establish a PPI prediction method, NECARE, which is specific for cancer. In the biological cell system, instead of isolation, genes act as a complex network. Genes may be regulated by others, control the expression of many other genes, or function together with other genes. Our model simulated this biological system by using a R-GCN, which uses the gene network information containing directions and types to predict the PPIs in cancer. Then, we compared our method with other two kinds of algorithms: 1) sequence-based methods and 2) network-based methods. Our system outperformed all other algorithms in the task of predicting PPIs in cancer. Sequence-based, state-of-the-art methods, such as PPI-Detect and PIPR [19,20], achieved good performance in PPI prediction of non-disease condition but failed in our cancer-specific task. Since proteins were acting as a network complex, the disorder information would be broadcasted among the network. And the interaction between two proteins may also be affected by their neighbors in the network. Therefore, sequence-based methods which only considered the input proteins themselves may not be very specific for cancer PPI prediction. This is also the reason why we used network-based algorithm combined with knowledge-based features such as OPA2Vec. Our system with R-GCN can use the information of types and directions of gene relationship to predict PPIs in cancer, while other network-based algorithms are not able to do so. Thus, our method is currently the best solution for cancer PPIs prediction. With the help of NECARE, we identified 1293 cancer hub genes that were enriched with network perturbations in cancer. As gene network perturbation was already found to be the main reason for cancer, these cancer hub genes should be the focus of the pathological mechanisms and treatment targets. Indeed, we found that a high mutation score of hub genes was significantly related to a poor prognosis of 32 different types of cancers. Almost half of the cancer treatment-related genes in the database TARGET were hub genes in our study. Thus, these hub genes we identified have a high potential to be the drug design targets for cancer treatment and the other clinical research. In addition, as mentioned before, we classified the hub genes into three types: Type 1 (gained links), Type 2 (lost links), and Type 3 (both gained and lost links). Unexpectedly, a lot of famous cancer genes were Type1 hub genes, and previous clinical studies also focused more on these hub genes. This phenomenon may be corresponding to the fact that cancer cells have their special characteristics, like limitless replicative potential, sustained angiogenesis and tissue invasion and metastasis. Gained links of genes in the network will lead to the new functions of the whole cellular system, which can in some extent explain the behavioral characters of cancer cells. This can also explain why previous clinical studies also focused more on these hub genes. Targeting the newly established PPI in cancer cells may inhibit the new functions obtained by them, which can further block the uncontrolled proliferation, migration and invasiveness of cancer cells. Actually, there are also some famous cancer related genes, which not only get a lot of new interactions but also lose some links with other genes in cancer network. These results are corresponding to the previous studies that, instead of the simple destruction, cancer mutations lead the reconstruction of the PPI network and those mutations located in PPI interfaces are highly correlated with patient survival [7,37]. So, as a new perspective of cancer research which may lead to a better understanding of the pathological mechanism of cancer, we should also focus on how the cancer genes reprogram the PPI network with both the links they lose and the new interaction they get. Maybe this will provide a treatment strategy for those intractable cancers. Overall, in our study, we established the first cancer-specific PPI prediction method. With the help of our new method, we analyzed PPI network perturbations in cancer and identified cancer hub genes. Our method provides a powerful tool for biology researchers and clinicians to find possible interacting partners of their input proteins in cancer. They can also choose to focus their research on the cancer hub genes identified by our method to develop new targets for cancer treatment. Materials and methods General gene relationship data To predict cancer PPIs with R-GCN, we need to build a knowledge graph which contained information of the relationship between genes (Fig 2). In order to build the knowledge graph, we extracted the general gene network data from the following three databases:1) STRING [39], a famous database for known protein-protein associations, from which we extracted data about the experimental annotated human protein-protein associations; 2) Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], a well-known publicly accessible pathway database, from which we extracted human non-disease pathway; and 3) HIPPIE [40], which contains experimentally detected PPIs from IntAct [41], MINT [42], BioGRID [43], HPRD [44], DIP [45], BIND [46] and MIPS [47]. Overall, our general gene relationship data contained 551850 pairs of interactions (S3 Table). The whole dataset is available from (github.com/JiajunQiu/NECARE/dataset/NECARE.graph). Cancer protein-protein interaction data Cancer protein-protein interaction data served as the training data for the R-GCN (Fig 2). We obtained cancer PPI data from the KEGG and Reactome databases [6,48], which served as the positive training set. We also included the OncoPPI database [7], which is an experiment-based cancer-specific PPI database, in our positive training set. The negative training data were the pairs of relationships with “disassociation/missing interaction” or other negative annotations in the KEGG cancer related pathways. Overall, we have 933 positive interactions (links) and 1308 negative interactions (links). The whole dataset is available from (github.com/JiajunQiu/NECARE/dataset/NECARE_TrainingData.txt). The 5-fold cross-validation We applied a 5-fold cross-validation approach for the training process (Figs 2 and S2). Technically, we divided the training set into five parts. In each rotation, we used three of the five parts for training, one for cross-training (optimize hyperparameters, including number of hidden units in neural network, early stop, etc.), and one for testing. Overall, we train the models with different hyperparameters and features on training set, and we picked the combination with best performance on the cross-training set (S4 Table). Finally, we evaluated the final performance on the testing set. The testing set was never used in the hyperparameter optimization and feature selection. Relational graph convolutional networks Graph convolutional networks (GCNs) can be understood as special cases of a simply differentiable message-passing framework. Information can be obtained from the neighbors of each node in the GCN. The R-GCN is an extension of the GCN [49]. It accounts for the edge type and direction and can calculate the forward-pass update of an entity or node denoted in relational (directed and labeled) multigraphs [49] (Fig 1D). (1) In Eq 1, if we define the directed and labeled multigraphs as with the nodes defined as , labeled edges as , and edge type as , then is the hidden state of node v i in the i-th layer of the neural network. denotes the set of neighbor indices of node v i under the relation C i,r is a normalization constant, which is defined as the degree of the target node of an edge. is a form of weight sharing among different relation types, and is a weight matrix for the linear message transformation. The incoming messages from neighbors are accumulated and then passed through an activation function σ such as ReLU [49]. Therefore, in our study, instead of only considering the gene itself, information about each gene was obtained from other genes that contacted it. Regarding to the feature we used to train the model, it was a combination of two parts. Part one was the OPA2Vec vector of each gene, which was a knowledge-based feature [50]. OPA2Vec is a tool that can be used to produce feature vectors for biological entities from ontology. OPA2Vec used mainly metadata from the ontology in the form of annotation properties as the main source of data. In this study, we used the OPA2Vec pretrained model based on PubMed data, and the annotation file was downloaded from http://purl.obolibrary.org/obo/go.owl. Part two was the cancer-specific feature based on The Cancer Genome Atlas (TCGA), including the expression profile of each gene in 32 different types of cancer and the mutation rate among patients for each type of cancer. Performance evaluation We evaluated the performance of the prediction via a variety of measures. For simplicity, we used the following standard annotations: true positives (TP) were the correctly predicted gene relationships in cancer, while false positives (FP) were the gene pairs that had no links in cancer and were incorrectly predicted to have interactions. True negatives (TN) were the correctly predicted noninteractions, and false negatives (FN) were the gene pairs that had interactions but were not correctly predicted. (2) We also calculated the Matthews correlation coefficient (MCC) and area under the curve (AUC): (3) Error estimates Error rates for the evaluation measures were estimated by bootstrapping (without replacement to render more conservative estimates), i.e., by resampling the set of samples used for the evaluation 1000 times and calculating the standard deviation of those 1000 different results. Each of these sample sets contained 50% of the original samples (picked randomly again, without replacement). Comparison with other methods and the independent data set The comparison with other methods were conducted on both training and independent dataset. The independent dataset was created based on literature-curated experiment results, which contains overall 229 cancer PPI annotations (github.com/JiajunQiu/NECARE/dataset/NECARE_IndependentData.txt). And we compared two different kinds of PPI prediction methods and fed them with related inputs: 1) sequence-based methods. Sequence-based methods took the sequences of two proteins as input and used the features such as chemical-physical properties of amino acids (Method: PPI-Detect) to predict the interaction between proteins. 2) Network-based methods. Network-based methods took the mapped interaction network as input and exploited the patterns characterizing the network to identify the interaction among the nodes. For example, method L3 predicted the interaction between two nodes by using paths of length 3 which connects two nodes in the input network. Cancer hub gene identification Cancer hub genes were defined as those genes that significantly lost (or gained) links in the cancer network, compared with the general network. Thus, to identify the cancer hub genes, we need two different networks: cancer PPI network and non-disease general network. Cancer PPI network was predicted by NECARE, while the general PPI network was defined by two parts:1) first, we extracted the literature-based general PPI network from the general gene network which was used in the training process of NECARE; 2) Literature curated interactomes of PPIs, which have excellent replicability, but are impacted by selection biases. To solve such problem, according to the previous publication [22], we also consider interactomes emerging from systematic screens, that lack such biases [51–54]. We used the cancer gene links connecting with an equal likelihood at the genes in the network as a null model. We assumed that, for a particular gene (node) to be called a putative hub gene, more links (gained or lost) must connect to that gene than expected by chance if the links were randomly connected to the genes in the network. Randomly, the frequency of links connected to any particular residues followed a binomial distribution: (4) where n is the 2x total number of links, k is the number of links connecting to a particular node, p is the probability of any individual link connecting at a particular node, and P (m = k) is precisely the probability of observed k links at a single node. Since our null model assumes an equal likelihood of links at any node, we used p = 2/L, where L is the overall number of unique nodes in the network. Thus, to assign a probability to the observation of k links connecting at a particular node by chance (i.e., a P-value), we calculated the probability of at least k links connecting at a particular node from our null model: (5) To correct for and test multiple hypotheses, the p-values for all considered hub genes were adjusted using the Bonferroni correction method. Eigenvector centrality was a measure of the influence of a node in a network. The regular eigenvector centrality of each gene in the network was the eigenvector of the adjacency matrix with the largest unique eigenvalue. Here, in our study, we applied a variant of eigenvector centrality [55]. The final centrality values followed the SoftMax probability: any node that you randomly picked up would reach a certain node in the network. Clinically related cancer genes Cancer genes related to clinical treatment were downloaded from the Tumor Alterations Relevant for GEnomics-driven Therapy (TARGET) database (https://software.broadinstitute.org/cancer/cga/target). TARGET (tumor alterations relevant for genomics-driven therapy) is a database of genes that, when somatically altered in cancer, are directly linked to a clinical action. TARGET genes are associated with response or resistance to a therapy, diagnosis, and/or prognosis. Survival analysis of hub genes To assess the association of hub genes with survival outcomes, we obtained the mutation and clinical prognosis data of 32 different types of cancers from the TCGA (S5 Table). For each cancer, we first calculated hazard ratios (HRs) and P-values (log-rank test) for each involved gene by Cox proportional hazards regression analysis using the coxph function of the R survival package (v. 2.37.2). Then, for each cancer, we integrated the hub genes with a significant P-value (cutoff: 0.05) into a combined mutation score (MS): (6) where M j is whether gene j is mutated in the tumor sample of the patient (1 for mutated and 0 for nonmutated) and W j is set to 1 or -1, depending on the HR of each gene (1 for HR ≥ 1 and -1 for HR<1). The median value (50%) or the automatically selected best cutoff value of the MS was used to divide the corresponding patients into high- and low-MS groups for Kaplan–Meier analysis of their association with the 10-year survival. Acknowledgments We want to thank Prof. Jingbin Yan and Dr. Luksa Popovic for helping revise the manuscript. [END] [1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009869 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/