(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ From complete cross-docking to partners identification and binding sites predictions ['Chloé Dequeker', 'Sorbonne Université', 'Cnrs', 'Ibps', 'Laboratoire De Biologie Computationnelle Et Quantitative', 'Lcqb', 'Paris', 'Yasser Mohseni Behbahani', 'Laurent David', 'Elodie Laine'] Date: 2022-02 Proteins ensure their biological functions by interacting with each other. Hence, characterising protein interactions is fundamental for our understanding of the cellular machinery, and for improving medicine and bioengineering. Over the past years, a large body of experimental data has been accumulated on who interacts with whom and in what manner. However, these data are highly heterogeneous and sometimes contradictory, noisy, and biased. Ab initio methods provide a means to a “blind” protein-protein interaction network reconstruction. Here, we report on a molecular cross-docking-based approach for the identification of protein partners. The docking algorithm uses a coarse-grained representation of the protein structures and treats them as rigid bodies. We applied the approach to a few hundred of proteins, in the unbound conformations, and we systematically investigated the influence of several key ingredients, such as the size and quality of the interfaces, and the scoring function. We achieved some significant improvement compared to previous works, and a very high discriminative power on some specific functional classes. We provide a readout of the contributions of shape and physico-chemical complementarity, interface matching, and specificity, in the predictions. In addition, we assessed the ability of the approach to account for protein surface multiple usages, and we compared it with a sequence-based deep learning method. This work may contribute to guiding the exploitation of the large amounts of protein structural models now available toward the discovery of unexpected partners and their complex structure characterisation. Proteins do not act alone, but perform their biological functions by interacting with each other. However, it is difficult to observe them directly in action, and to collect unbiased clear-cut data on their association. Here, we propose to exploit the protein 3D structures and models accessible nowadays to discover new interactions and alternative binding modes. We simulate the binding of thousands of hundreds protein pairs, and estimate the interaction strength of each pair based on their geometric, physico-chemical and evolutionary properties. We measure proteins’ “sociability”, and identify a set of putative partners for each protein. We give some guidance for choosing the parameters, and we provide a readout of the predictions. Our approach can complement experimental data, and also predictions produced by machine learning methods relying on protein sequences. Here, we present a general approach for the identification of protein partners and their discrimination from non-interactors based on molecular docking. Like our previous efforts [ 50 , 53 , 54 ], this work aims at handling large ensembles of proteins with very different functional activities and cellular localisations. Although these classes of proteins appear to have different behaviours, we approach the problem of partner identification from a global perspective. We report on the analysis of data generated by CC-D simulations of hundreds of proteins. We combine together physics-based energy, interface matching and protein sociability, three ingredients we previously showed to be relevant to partner identification and discrimination [ 50 , 53 , 54 ]. We move forward by investigating what other types of information may be needed to improve the discrimination. To this end, we systematically explore the space of parameters contributing to partner identification. These include the scoring function(s) used to evaluate the docking conformations, the strategy used to predict interacting patches and the size of the docked interfaces. We show that our approach, CCD2PI (for “CC-D to Partner Identification”), reaches a significantly higher discriminative power compared to a previous study addressing the same problem [ 53 ]. We demonstrate that this result holds true overall and also for individual protein functional classes. Our results emphasise the importance of the docking-inferred residue binding propensities to drive interface prediction, and the positive contribution of a statistical pair potential to filter docking conformations. We define a set of default parameter values, with minimal variations between the different classes, for practical application to any set of proteins. Importantly, we place ourselves in a context where we do not know the experimental interfaces and use predictions instead. To evaluate CCD2PI predictions, we consider structurally characterised interactions coming from the Protein Data Bank (PDB) [ 55 ] as our gold standard. We primarily consider the docking benchmark annotations [ 56 ], and we extend them by transferring knowledge from complex structures involving the same or very similar proteins. This strategy is supported by the observation that functional interfaces are conserved across closely related homologs [ 57 ]. Moreover, previous works from us and others have emphasised its biological pertinence and usefulness to evaluate protein-protein/DNA/RNA interface prediction methods [ 23 , 58 ]. We show that the protein interaction strengths computed by CCD2PI are in good agreement with available structural data. We discuss the implications of these strengths for protein functions. This work paves the way to the automated ab initio reconstruction of protein-protein interaction networks with structural information at the residue resolution. Since, the reconstruction is based on docking calculations, it not biased by specific targets nor by the limitations of experimental techniques. In principle, the estimation of systemic properties such as residue binding propensity and protein sociability shall be more accurate as more proteins are considered in the experiment. But the problem of discriminating them will also become harder. When dealing with several hundreds of proteins, the correct identification of the cognate partners requires an incredible accuracy as they represent only a small fraction of the possible solutions. For instance, a set of 200 proteins for which 100 binary interaction pairs are known will lead to the evaluation of 40 000 possible pairs, and for each pair several hundreds of thousands candidate conformations (at least) will have to be generated and ranked. In a large-scale docking experiment, hundreds or thousands of proteins are either docked to each other (complete cross-docking, CC-D) or to some arbitrarily chosen proteins. The generated data can be straightforwardly exploited to predict protein interfaces [ 23 , 44 – 47 ]. Indeed, randomly chosen proteins tend to dock to localised preferred regions at protein surfaces [ 48 ]. In this respect, the information gathered in the docking experiment can complement sequence- and structure-based signals detected within monomeric protein surfaces [ 23 ]. Beyond interface and 3D structure prediction, very few studies have addressed the question of partner identification. The latter has traditionally been regarded as beyond the scope of docking approaches. However, an early low-resolution docking experiment highlighted notable differences between interacting and non-interacting proteins [ 49 ], and we and others [ 50 – 53 ] have shown that it is possible to discriminate cognate partners from non-interactors through large-scale CC-D experiments. An important finding of these studies, already stated in an earlier experiment involving 12 proteins [ 54 ], is that relying on the energy function of the docking algorithm is not sufficient to reach high accuracy. This holds true for shape complementarity-based energy functions [ 50 ], and also for those based on a physical account of interacting forces [ 53 , 54 ]. Nevertheless, combining the docking energy with a score reflecting how well the docked interfaces match experimentally known interfaces allows reaching a very high discriminative power [ 53 ]. Moreover, the knowledge of the global social behaviour of a protein can help to single out its cognate partner [ 50 , 53 ]. That is, by accounting for the fact that two proteins are more or less sociable, we can lower down or lift up their interaction strength, and this procedure tends to unveil the true interacting partners [ 50 ]. This notion of sociability also proved useful to reveal evolutionary constraints exerted on proteins coming from the same functional class, toward avoiding non-functional interactions [ 50 ]. A related problem is the prediction of the 3D arrangement formed between two or more protein partners. This implies generating a set of candidate complex conformations and correctly ranking them to select those resembling the native structure. Properties reflecting the strength of the association include shape complementarity, electrostatics, desolvation and conformational entropy [ 28 ]. Experimental data and evolutionary information (conservation or coevolution signals) may help to improve the selection of candidate conformations [ 29 – 31 ]. To address this problem, molecular docking algorithms have been developed and improved over the past twenty years, stimulated by the CAPRI competition [ 32 – 36 ]. Nevertheless, a number of challenges remain, including the modelling of large conformational rearrangements associated to the binding [ 32 , 37 , 38 ]. Moreover, homology-based modelling often leads to better results than free docking when high-quality experimental data is available. In the past years, a lot of effort has been dedicated to describe the way in which proteins interact and, in particular, to characterise their interfaces. Depending on the type and function of the interaction, these may be evolutionary conserved, display peculiar physico-chemical properties or adopt an archetypal geometry [ 10 – 20 ]. For example, DNA-binding sites are systematically enriched in positively charged residues [ 10 ] and antigens are recognized by highly protruding loops [ 12 ]. Such properties can be efficiently exploited toward an accurate detection of protein interfaces [ 10 – 12 , 21 – 27 ]. However, the large scale assessment of predicted interfaces is problematic as our knowledge of protein surface usage by multiple partners is still very limited [ 23 ]. The vast majority of biological processes are ensured and regulated by protein interactions. Hence, the question of who interacts with whom in the cell and in what manner is of paramount importance for our understanding of living organisms, drug development and protein design. While proteins constantly encounter each other in the densely packed cellular environment, they are able to selectively recognise some partners and associate with them to perform specific biological functions. Discriminating between functional and non-functional protein interactions is a very challenging problem. Many factors may reshape protein-protein interaction networks, such as point mutations, alternative splicing events and post-translational modifications [ 1 – 5 ]. Conformational rearrangements occurring upon binding, and the prevalence of intrinsically disordered regions in interfaces further increase the complexity of the problem [ 6 – 9 ]. Ideally, one would like to fully account for this highly variable setting in an accurate and computationally tractable way. Results Computational framework The workflow of CCD2PI is depicted in Fig 1. We exploit data generated by CC-D experiments performed on hundreds of proteins. In the present work, the CC-D was performed using the rigid-body docking tool MAXDo [54]. The proteins are represented by a coarse-grained model and the interactions between pseudo-atoms are evaluated using Lennard-Jones and Coulombic terms [42]. For each protein pair, MAXDo generated several hundreds of thousands of candidate complex conformations (Fig 1, top left panel). Each one of these conformations is evaluated by computing the product of the overlap between the docked interface (DI) and some reference interface (RI), a docking energy (either from MAXDo or another one, see Materials and methods), and a statistical pair potential [59] (optional). By formulating the score as a product, we effectively use the interface overlap, the docking energy and the pair potential as successive filters to select the best conformation. The rationale is that ideally, the selected conformation should meet all three criteria: match the expected interface, be energetically favourable, and reflect the amino-acid pairing preferences found in experimental complexes. For instance, let us consider a conformation displaying a perfect interface overlap, but with the interacting surface of the ligand rotated by 180° with respect to that of the receptor. It would have a very low fraction of native contacts, and we expect it to be correctly filtered out by the docking energy and/or the pair potential. We detect the DIs based on interatomic distances using our efficient algorithm INTBuilder [60]. To place ourself in a realistic scenario, we predict the RIs, instead of extracting them from the known complex structures. Our predictive algorithm relies on sequence- and structure-based properties of single proteins [12], as well as a systemic property, namely residue binding propensities inferred from the CC-D [23] (see Materials and methods). Formally, given two proteins P 1 and P 2 , we estimate the interaction index of P 1 with respect to P 2 as (1) where (Fraction of Interface Residues) is the fraction of the DIs composed of residues belonging to the (predicted) RIs for the two proteins, is the docking energy (negative value) and is a pair potential score which may or may not be included in the formula. The latter evaluates the likelihood of the observed residue-residue interactions and might bring complementary information with respect to the docking energy. We use CIPS [59], a high-throughput software designed to swiftly reduce the search space of possible native conformations with a high precision. The minimum is computed over the whole set or a pre-filtered subset of docking conformations (see Materials and methods). One should note that in the general case, and come from two different docking runs and are not necessarily equal. This is because the receptor and ligand surfaces are not explored in an equivalent manner by the docking algorithm (see Materials and methods). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. Principle of the method. We start from an all-to-all docking experiment (top left panel). Each protein is docked to all proteins in the set. By convention, in each docking calculation, we define a receptor and a ligand. The red patches on the protein surfaces correspond to predicted interfaces. For a given protein pair P 1 P 2 , we generate a pool of conformations associated with energies (top middle panel). Here, both the predicted interfaces and the docked interfaces are highlighted by patches, in red and purple respectively. One can readily see whether they overlap or not. The extent of this overlap (Fraction of Interface Residue) is multiplied by the docking energy to evaluate each docking conformation (bottom left panel). Optionally, we also consider a statistical pair potential in the formula. The best score is computed over all docking conformations and assigned to the protein pair. By doing the same operation for all pairs we compute a matrix of interaction indices (bottom right panel, the darker the higher). If the receptor and the ligand play equivalent roles in the docking calculations, then the matrix will be symmetrical. Otherwise, two different docking calculations are performed for each protein pair P 1 P 2 and the matrix will be asymmetrical, as shown here. These indices are then normalised to account for proteins’ global social behaviour, hopefully allowing for singling out the cognate partners (top right panel). In the example here, the cognate pairs are ordered on the diagonal. https://doi.org/10.1371/journal.pcbi.1009825.g001 The computed interaction indices (Fig 1, matrix at the bottom right) are then normalised to account for the protein global social behaviour. Formally, the II values are weighted using the sociability index (S-index) [50], defined as (2) where is the ensemble of proteins, including P i . The normalised interaction index NII between P 1 and P 2 is computed as a symmetrised ratio of interaction indices (see Materials and methods). Finally, the NII values are scaled between 0 and 1 and when P 2 is the protein predicted as interacting the most strongly with P 1 (Fig 1, matrix on the top right). CCD2PI accurately singles out cognate partners within specific functional classes We assessed the discriminative power of CCD2PI on a set of 168 proteins forming 84 experimentally determined binary complexes (Protein-Protein Docking Benchmark v2, PPDBv2, see Methods). Here, we place ourselves in a context where we seek to identify one “true” partner, annotated in the PPDBv2, for each protein from the benchmark. Over all possible 28 224 interacting pairs, the cognate partners were singled out with an Area Under the Curve (AUC) of 0.67 (Fig 2A). On average, 3 putative partners were predicted with a NII score above 0.8, and about 10 above 0.6, for each given protein (Fig 2C and S1 Fig). Hence, CCD2PI assigns high interaction strengths to a relatively small number of pairs, compared to the enormous number of potential pairs. In this respect, the contribution of the normalisation stands out as instrumental (S2A and S2B Fig, compare the number of dark spots between the II and NII matrices). By lowering down the interaction strengths computed for highly sociable proteins, it eliminates most of the “incorrect” partners. Given a protein, only the putative partners binding favourably to it, with a high II score, and in a specific manner, as indicated by a low S-index, stand out after the normalisation. This effect is illustrated by S3 Fig on the human GTPase-activating protein p120GAP and gonadotrophin. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 2. Predictive performance on the PPDBv2. (A) AUC values computed for the whole dataset and for the different functional classes. For each protein, we consider one “true” cognate partner, defined from the PPDBv2 annotations. The results obtained with CCD2PI are indicated by the blue curve. For comparison, we also show the results reported in [53] in purple. The areas in grey tones give the discriminative power reached when exploiting the knowledge of the experimental interfaces, using either our default parameters (in light gray) or parameters optimized for such interfaces (in dark grey, see also Materials and methods). The number of proteins in each subset is indicated in parenthesis. (B) Proportion of proteins with at least one known partner found in the top 20% of CCD2PI predictions, for each subset. The known partners are defined from the PPDBv2 annotations (in blue) or are inferred from complex PDB structures involving the proteins from the set or their close homologs, sharing more than 90% (in dark red) or 70% (in orange) sequence identity. The grey bars give baseline expected values based on the number of known partners (see Materials and methods). (C) NII matrices computed by CCD2PI. The proteins are ordered on the x-axis such that the receptors (e.g. antibodies) appear first, and then the ligands (e.g. antigens). They are ordered on the y-axis such that the cognate pairs annotated in PPDBv2 are located on the diagonal. The orange tones highlight the experimentally known interacting pairs (annotated in the PPDBv2 and transferred by homology). AA: antibody-antigen, ABA: bound antibody-antigen. EI: enzyme-inhibitor. ER: enzyme with regulatory or accessory chain. ES: enzyme-substrate. OG: other-with-G-proteins. OR: other-with-receptor. OX: others. https://doi.org/10.1371/journal.pcbi.1009825.g002 The docking energy and the pair potential in Eq 1 (II formula) will favour the protein pairs whose RIs have a high physico-chemical and shape complementarity. Consistently, we observed that the RIs of the proteins predicted as plausible partners for a given protein share some common 3D physico-chemical patterns. For instance, we can clearly identify a pattern of positively charged residues common to the RIs of the “incorrect” top 5 predicted partners for the human GTPase-activating protein p120GAP (1WQ1_l) and the RI of its cognate partner H-RAS, ranked at the 6th position (S3A Fig). In the case of the human gonadotrophin (1QFW_l), the RI of its cognate antibody, ranked 13th, displays an enrichment in negatively charged and aromatic residues, also observed for the RIs of the “incorrect” top 5 predicted partners (S3B Fig). We further assessed CCD2PI’s ability to identify the PPDBv2 cognate partners among proteins coming from the same functional class (Fig 2A, blue curve). The partnerships between bound antibodies and their antigens (ABA), between enzymes and their inhibitors, substrates, or regulators (EI, ES, ER) and between the other proteins and their receptors (OR) are particularly well detected (AUC>0.75). By contrast, the subset regrouping everything that could not be classified elsewhere (others, OX) is the most difficult to deal with. This subset likely contains proteins involved in signalling pathways and establishing transient interactions through modified sites, such as phosphorylated sites. As a consequence, correctly predicting their interfaces may be particularly challenging. Conformational changes occurring upon binding seem to play a role as the antibody-antigen cognate pairs are better detected when the antibodies are bound (Fig 2A, compare AA and ABA). The AUC values achieved by CCD2PI are systematically and significantly better than those computed with our previous pipeline (Fig 2A, compare the blue and purple curves), or similar in the case of the other-with-G-protein class (OG). Replacing the predicted RIs by the interfaces extracted from the PDB complex structures, which can be seen as perfect predictions, leads to increased AUC values for almost all classes (Fig 2A, areas in grey tones, and S2C and S2D Fig). This suggests that proteins competing for the same region at the protein surface do not target exactly the same set of residues. Knowing exactly which residues are involved in an interaction greatly helps in the identification of the partner. Of course, this perfect knowledge is generally inaccessible in a fully predictive context. In fact, the predicted interfaces might give a more realistic view on protein surface usage since they tend to better match interacting regions [23], defined from several experimental structures and representing the interface variability induced by molecular flexibility and multi-partner binding. Noticeably, the advantage of experimental over predicted RIs reduces or even cancels out for the small subsets (<15 proteins, ER, ES and OR). This suggests that approximations in the definition of the interfaces do not influence partner identification when few proteins are considered. The interaction strengths predicted by CCD2PI reveal the multiplicity of protein interactions To estimate the agreement between the interaction strengths predicted by CCD2PI and experimental data, we extended the set of “true” partners by homology transfer. Specifically, we looked in the PDB for 3D structures of complexes involving the proteins from PPDBv2 or their close homologs (see Materials and methods). We considered that a structurally characterised interaction found for and , sharing a high sequence similarity with P 1 and P 2 , respectively, was a strong indicator of the possibility for P 1 and P 2 to interact with each other. We identified 585 interacting pairs from homologs sharing more than 90% sequence identity with the proteins from PPDBv2, and 1 834 at the 70% sequence identity level (Fig 2C, cells colored in orange). These high levels of sequence similarity ensure a high confidence in the newly detected interactions, although homology transfer per se does not guarantee they are functional in the cell. We observed the biggest increase in the number of partners for the antibodies (Fig 2C, S4A, S4B and S4C Fig). Some of the homology-transferred partners are direct competitors of the cognate partners annotated in PPDBv2 as they target the same region at the protein surface. Depending on the approximations in the predicted RIs, the former may be more favoured than the latter by CCD2PI. A few examples of homology-transferred partners better ranked than the PPDBv2-annotated partners are shown in S5 Fig. Overall, the probability of finding at least one “true” partner in the top 20% predictions is almost systematically increased when extending the set of positives (Fig 2B). For instance, 71% (27 out of 38) of the proteins from the EI subset have at least one partner inferred at more than 70% sequence identity ranked in the top 7. Moreover, the homology-transferred interactions tend to populate the regions of the matrices displaying high interaction strengths (Fig 2C and S4D Fig). For instance, CCD2PI predictions suggest that antigens tend to avoid each other much more than antibodies, and indeed much more homology-transferred interactions are found among antibodies, compared to antigens (AA and ABA). A similar trend is also observed for the enzyme-regulator (ER) and enzyme-substrate (ES) and other-with-G-protein (OG) subsets (Fig 2C and S4D Fig). We observe more predicted and experimental regulator-regulator and substrate-substrate interactions than enzyme-enzyme interactions, and more other-other interactions than interactions among G proteins. Small approximations in the reference interfaces may significantly impact partner identification We further characterised the relationship between the ability of singling out cognate partners and the resemblance between the predicted and the experimental interfaces. The average F1-values of the predicted interfaces range between 0.37 and 0.58 (Fig 3E). The strategy leading to the best AUC values for partner discrimination, namely SC-dockSeed-mix, gives the most accurate predicted interfaces overall (Fig 3E, 3F and 3G, ALL). It is also significantly more precise than the other strategies in the detection of the antibody-antigen interfaces (Fig 3E, 3F and 3G, AA and ABA). Looking across the different classes, it is a priori not obvious to assess a direct correlation between the quality of the predicted interfaces and the discriminative power of the approach. In particular, the three subsets (ER, ES and OR) for which predicted RIs lead to AUCs as good as those obtained with experimental RIs (Fig 2A) do not stand out for the quality of their predicted interfaces (Fig 3E, 3F and 3G). This confirms that when dealing with few proteins (<15), working with approximate interfaces do not hamper the identification of the cognate partners. However, if we disregard these subsets, then we find that the ability to detect the cognate pairs is highly correlated with the F1-score and the precision of the predicted interfaces (S8 Fig). The Pearson correlation coefficient is of 0.86 (resp. 0.90) between the AUC values and the F1-scores (resp. positive predictive values, PPV) computed for SC-dockSeed-mix. Focusing on the 16 proteins for which the F1-score is very low (<0.2), we found that replacing the predicted interfaces by the experimental ones largely improves the ability to single out the cognate partner in half of the cases (S9 Fig). Nevertheless, in the remaining half, improving interface quality brings little gain to partner identification, or even has a deleterious impact. In five cases, the cognate partner is even identified in the top 20% despite the low quality of the predicted RI. These results reveal the existence of protein surface regions onto which cognate partners bind more favourably than non-interactors, although they have not been experimentally characterised as directly involved in the interaction. We hypothesise that these regions might correspond to alternative binding modes with the cognate partner. To investigate more precisely the sensitivity of partner discrimination with respect to approximations in the RIs, we generated shifted decoys from the experimental interfaces. For each interface in the dataset, we moved between 10 and 100% of its residues, by increments of 10% (see Materials and methods). This allowed us to control the deviation of our RIs with respect to the experimentally known interfaces of the cognate interactions. We observed that the AUC computed for partner identification decreases as the shifted decoys share less and less residues in common with the experimental interfaces (Fig 4). The only notable exception is the smallest class, namely ER, which displays a chaotic behaviour. The two other smallest classes, ES and OR also show some chaotic variations, to a lesser extent. On the whole dataset, the AUC drops by 0.12 when the interfaces are shifted by 10%,corresponding to an F1-score of 0.9. A similar or even bigger gap is observed for all subsets comprising more than 15 proteins, except the enzyme-inhibitor subset (EI). On the whole dataset, the two antibody-antigen subsets (AA and ABA) and the other subset (OX), we identify cognate partners with en AUC lower than 75% with shifted decoys that still match very well (F1-score >0.8) the experimental interfaces. This shows that many competing proteins are able to bind favourably to almost the same protein surface region as the cognate partner. Compared to the shifted interfaces, our predicted interfaces allow reaching a similar or better partner discrimination for all classes but ER. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 4. Sensitivity of partner identification to approximations in the reference interfaces. The RIs were obtained by gradually shifting the experimental interfaces (see Materials and methods). On each plot, we show 10 boxes corresponding to 10 different shift magnitudes. Each box comprises 10 AUC values obtained from 10 random generations of shifts in interfaces at a given amplitude. The values in x-axis give the average F1-scores computed for these shifted interfaces. The red dot and the blue triangle indicate the performance achieved using the experimental interfaces and the interfaces predicted by SC-dockSeed-mix as RIs, respectively. To compute the AUCs, we used the parameters identified as the best ones when using the experimental interfaces as RIs, namely a distance threshold of 6Å, the MAXDo docking energy, and without CIPS. https://doi.org/10.1371/journal.pcbi.1009825.g004 Accounting for protein surface multiple usage Next, we assessed CCD2PI on an independent set of 62 proteins for which we defined some interacting regions accounting for the multiple usage of a protein surface by several partners and for molecular flexibility [23]. More precisely, we obtained each interacting region by merging overlapping interacting sites detected in the biological assemblies (from the PDB) involving the protein itself or a close homolog, as described in [23]. These regions can be seen as binding “platforms” for potentially very different partners. In this experiment, we used predicted interfaces as RIs, and all of them match well the experimentally known interacting regions (F1-score>0.6). CCD2PI identifies at least one known partner in the top 3 predictions (3/62 = 5%) for about a third of the proteins (Fig 5A, inset). For instance, CCD2PI identifies the Bcl-2-like protein 11 (2nl9:B), known partner of the Mcl-1 protein (2nl9:A), at the second position. It ranks first a tropomyosin construct (2z5h:B) that folds into an α-helical shape similar to that of the known partner. For trypsin-3 (2r9p:A), five proteins are predicted as better binders than its known inhibitor (2r9p:E). An extreme example is given by the heme oxygenase (1iw0:A), whose interaction with itself is very poorly ranked. This may be explained by the fact that the homodimer is asymmetrical, with two different interaction sites for the two copies, one of them not being taken into account by CCD2PI. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 5. Assessment of CCD2PI on an independent dataset, and comparison with a sequence-based deep learning method. (A) The main barplot gives the rank(s) determined by CCD2PI for the known partner(s) of each protein from the independent dataset. The partners are inferred from the complex PDB structures involving the proteins from the set or their close homologs, sharing more than 90% sequence identity (see S10 Fig for the 70% sequence identity level). There are up to 4 partners for each protein, and they can be distinguished by the blue tones. The experimental structures of 3 cognate complexes are depicted as cartoons, with the query protein in dark grey and the best-ranked known partner in dark blue. For 2nl9:A and 2r9p:A, we also show, in other colors, the “incorrect” partners that obtained better ranks than the best-ranked known partner. For the complex made of two copies of 1iw0:A, the position and orientation of the copies was taken from the PDB structure 1wzg. The surfaces represent the RIs. The barplot in inset gives the proportion of proteins with at least one known partner in the top x% predictions. The grey bars give baseline values expected based on the number of known partners (see Materials and methods). (B) Comparison with DPPI. Best known partner ranks obtained from CCD2PI (on top) and DPPI (at the bottom). We focus on the subset of proteins for which the ranks provided by CC2PI are better (see S11 Fig for the full distributions). https://doi.org/10.1371/journal.pcbi.1009825.g005 [END] [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009825 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/