(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ Global diversity and balancing selection of 23 leading Plasmodium falciparum candidate vaccine antigens ['Myo T. Naung', 'Population Health', 'Immunity Division', 'Walter', 'Eliza Hall Institute Of Medical Research', 'Parkville', 'Victoria', 'Department Of Medical Biology', 'University Of Melbourne', 'Carlton'] Date: 2022-02 WGS data from a total of 2661 P. falciparum isolates from 16 countries from the Asia-Pacific and Africa were processed to obtain gene sequences for population genetic analysis. The pipeline filters the dataset to remove high complexity infections as determined by F ws [24]. This step aimed to ensure the correct assignment of variants to a single major clone within the sample (S2 Fig). We found a large proportion (greater than 30%) of field isolates from African regions predicted to have more than two clones, especially in Malawi and Ghana, likely reflecting high transmission intensity in those areas. However, the number of genomes in the final dataset (n = 1499) remained high, even after removing these samples. From these genomes, gene sequences for 23 antigens (Table 1) were obtained. This included a mean of 1374 (1079–1499) sequences per antigen, and 107 (26–433) sequences per country (Table 1, Table A in S1 Text). Nigeria was included in overall genetic diversity analyses but removed from further analyses due to very low sample size (n = 4). The use of full-length or full-domain Tajima’s D values can mask variable patterns within a gene when there are discrete stretches of positive D values linked by regions with neutral or highly negative D values [ 27 ]. Further analyses were therefore performed on these antigens for full-length or their functionally important domains. Genetic diversity and evidence of diversifying selection on the antigens was not related to their subcellular localization [ 26 ] ( Fig 1 ). However, the diversity of some antigens (e.g. CSP) varies by geographic origin indicating that transmission intensity or other location-specific factors may play a role as previously shown in Barry et al. 2009 [ 15 ]. The lines from ridgeline plots indicate the range and distribution of respective diversity parameters for each antigen across different parasite populations (countries). Tramp, and cyrpa were conserved across all populations, whilst trap, ama1, eba175, and celtos were diverse and showed evidence of diversifying selection across all parasite populations. Overall genetic diversity analyses based on Nei’s nucleotide diversity (π), haplotype diversity and Tajima’s D analyses indicated high proportions of SNPs that were non-synonymous (dN) and regions of balancing selection in most antigens [ 25 ] (Table A in S1 Text , Fig 1 ). In addition, the antigen subdomains, csp-C-terminal, eba175 -RII and pfs48/45 -6C were as diverse as the respective full-length gene ( Fig 1 , Table A in S1 Text ). Different patterns of diversity across parasite populations (defined as all samples from a particular country) were observed for the antigens. For instance, trap had high haplotype diversity yet low to moderate nucleotide diversity indicating that many haplotypes were the result of very rare polymorphisms ( Fig 1 , Table A in S1 text ). Whereas csp had high haplotype diversity and high (albeit variable amongst populations) nucleotide diversity, which might be related to differences in transmission intensity in different countries ( Fig 1 ). In addition, celtos (a gametocyte antigen) also had high haplotype diversity but moderate to high (and variable) nucleotide diversity, which may indicate adaptation to mosquito vectors from different geographical regions ( Fig 1 , Table A in S1 Text ). Moderate haplotype diversity, but low nucleotide diversity for rh5 indicates lower diversity among distributed rh5 haplotypes ( Fig 1 , Table A in S1 Text ). Of all analysed antigens, the 3D7 haplotype was dominated only within CyRPA, MSP1-19, MSP3, MSP4, Pfs48/45, RALP1, RH5, RIPR, STARP, TRAMP, and TREP antigens ( Fig 2 , Tables A and B in S1 Text ). For most antigens with multiple balancing selection hotspots such as AMA1, CSP (C-term), EBA175 (RII), CelTOS, the most common haplotypes were only remotely related to 3D7 vaccine haplotype (Table B in S1 Text ). No 3D7 haplotypes were observed in the dataset for full length EBA175, MSP1 and SERA8 (Table A in S1 Text , Fig 2 ). This suggests that a 3D7-based malaria vaccine may be effective against only a small proportion of natural parasite populations. Templeton, Crandall, and Sing (TCS) network summarizing the global diversity of selected antigens using only common haplotypes (> 0.5% of all haplotypes) based on non-synonymous SNPs for full length respective domain of each antigen. Circles represent unique haplotypes, and circles are scaled according to the prevalence of the observed haplotypes. The number of non-synonymous SNP differences between each haplotype was shown by the number of hatch marks on the branches. The vaccine strain 3D7 (arrowed) was included for reference. Haplotypes were often found in many countries (i.e. multiple colours for each node) suggesting minimal geographic variation ( Fig 2 ). However, geographical clustering of haplotypes was observed for CTRP, TRAP (ectodomain), CSP (C-term), GLURP, EBA175 (RII), CelTOS, and Pfs48/45. Haplotypes from the same geographical region (e.g. Africa) were more closely related than those from different regions (e.g. Africa versus Asia, Fig 2 ). Most of these antigens are pre-erythrocytic or gametocyte antigens, consistent with previous observations [ 15 ] and may reflect adaptation of the parasite to local human and mosquito hosts. We identified the relationships amongst non-synonymous SNP haplotypes using network analysis for 17 of the 23 antigens. Six antigens were excluded because of lack of diversity. Many rare haplotypes are present within African populations, and relatively common haplotypes within Asia-Pacific countries, most likely owing to higher transmission in Africa. Generally, for antigens under balancing selection such as CelTOS, TRAP, AMA1, EBA175, MSP1 and GLURP ( Fig 1 ), high haplotype diversity (Hd > 0.97) was ubiquitous amongst countries with no predominant haplotype (Table B in S1 Text ). The majority of these are blood stage antigens suggesting they are dominant targets of natural host immunity. Relative solvent accessibility (RSA) was calculated for all residues for 23 antigens. RSA was calculated using neural network based NetSurfP1.1 program or DSSP program respectively based on the presence of known PDB or homology-modelled structures [ 30 , 31 ]. Polymorphic residues from more than 1000 sequences regardless of minor allele frequency (MAF) were included in the analysis. Box and whisker plots show the median (blue line), and interquartile range (blue box) of RSA values for each residue from respective group. The violin plot (which uses Kernel Density Estimation to compute an empirical probability distribution) shows a smooth distribution of RSA values for most of the calculated group. RSA scores for individual antigens as well as for the combination of all antigens were calculated. Only significant p-values are shown. Relative solvent accessibility (RSA) analysis gives an indication of the regions of a protein that are exposed to the extracellular environment, and thus may be targeted by immune responses [ 16 , 28 , 29 ]. The RSA of all 23 antigens was compared for 792 polymorphic and 14,789 conserved residues. A higher RSA score suggests presence of a particular amino acid on the surface and thus making it able to interact with the host environment. Overall, when combining all antigens, most of the polymorphic residues had significantly higher mean RSA scores of 0.46 than conserved residues with a mean RSA score of 0.38 (p value < 2.0 * 10 −16 , t-test) ( Fig 3 ). When individual antigens were analysed, only AMA1, CSP, EBA175, MSP1, MSP4, SERA8, and TRAP had significantly higher RSA scores than that of their residues without underlying polymorphisms (p < 0.05, t-test). (a) Available domain and epitope information for AMA1, EBA175 (RII), RH5, and CSP (C-terminal). (b) Site specific diversity measure for CelTOS, Pfs48/45 (6C domain), TRAP (ectodomain), CSP (C-terminal), AMA1, EBA175 (RII) and RH5. Normalized Shannon Entropy was calculated per residue for these antigens (unavailable residues were coloured in white). Higher entropy values indicate higher diversity across all populations for a particular residue. Residues from the CSP Th2R epitope and AMA1 C1L loop have the highest entropy values. Very low entropy values across all populations were observed for SERA5, and CyRPA. Shannon Entropy is a measure of the variability of individual amino acid residues within a protein. We calculated the normalized Shannon Entropy using all available sequences for each antigen that had an available 3D protein structure ( Table 2 ) to investigate the potential functional impact of site-specific diversity. Except for CyRPA, SERA5 and MSP1-19, the normalized Shannon Entropy scores were high (>0.2) among the residues situated at host-receptor binding interface, within previously described immunological epitopes or under balancing selection ( Fig 4 , Table 2 ). For instance, residues situated at the surface exposed c1L loop of AMA1, and residues situated at the dimerization interface of EBA175 RII had high normalized Shannon Entropy scores. Mutations at functionally important interfaces may enable parasites to escape from host immune responses. Evidence of balancing selection at functionally important interfaces for TRAP, AMA1, RIPR for all geographic populations but identification of geographically variable balancing selection hotspots for EBA175 (RII), RH5, CSP, CelTOS, and MSP1-19 An ideal malaria vaccine should be immunogenic and effective against naturally circulating strains from worldwide populations, but it is not feasible to include all haplotypes for each antigen we have described above. This is especially the case for highly diverse antigens such as CSP, TRAP, AMA1, EBA175, MSP1 (Table A in S1 Text). Therefore, for each antigen, it is important to identify the most immunologically relevant polymorphisms and gene regions, that are targets of host immunity and could influence vaccine efficacy. Typically, the Tajima’s D statistic has been calculated as a single metric encompassing the entire gene or domain of interest, or more recently, using a sliding window approach to identify hotspots of balancing selection along the length of a gene. Functional antibodies are critical for protective immunity from naturally acquired infection [6,32–37] however, more than 90% of functional antibody epitopes are discontinuous (non-linear) epitopes [38,39]. Thus, distant gene segments may be brought into proximity in three-dimensional (3D) protein structures. Consideration of structural features in 3D space are thus important, especially for antigens with highly surface exposed polymorphic residues such as AMA1, CSP, EBA175, and TRAP. Previous studies have suggested that polymorphic residues located on the surface of protein evolved to escape host immune responses [16,27,40]. We therefore examined selective pressure over the 3D protein structures of vaccine candidate antigens in the following analyses where possible. A spatially derived approach to Tajima’s D (D*) was applied to available 3D structures (predicted or experimentally defined) for a subset of parasite populations covering all major endemic regions, including Southeast Asia, Africa, and Oceania (PNG). Structures included well-studied antigens like CSP (C-terminal), TRAP (ectodomain), AMA1, EBA175 (RII), MSP1, RH5, CyRPA, SERA5, CelTOS, and MSP1-19. We observed moderate to high D* scores (1.0–2.0), high scores (2.1–3.0), and extremely high scores (> 3.1) within most of these antigens, which is an indication of balancing selection focused on specific regions of the protein. Nearly neutral Tajima’s D values were found throughout the 3D structure of CyRPA, and SERA5 (C-terminal) for all parasite populations (S4 Fig). In general, we found a dichotomous pattern of balancing selection on some antigens for countries in the Asia-Pacific versus African populations. The TRAP ectodomain (PDB Code: 4HQF.A) which is comprised of tandem von Willebrand factor A (VWA) and thrombospondin type I repeat (TSR) domains (AA residues: E41 –K240, 3D7 sequence) were included in our analyses [41]. In all populations, the suspected heparin binding interface of TRAP [42] has moderately higher Tajima’s D* scores than the opposite interface of the protein, which acts like the silent face of TRAP (ectodomain) (Fig 5A). Hence, residues Y89 –I100, and I115 –D146 showed moderate spatially derived D* scores (1.0–1.2) in all observed countries. Residues S123, T124, and N125 form a minor protrusion on the surface of TRAP, and residues R130, R141, K142 are thought to be mediated in heparin binding according to previous study [42]. However, within the active face of TRAP, the intensity of balancing selection (D* scores) appears to be geographically variable. Minimal nucleotide diversity variation was observed amongst geographical regions (Fig 1). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 5. Antigens displaying geographically conserved balancing selection at functionally important interfaces. Antigens were not normalised based on their sizes. a. Spatially derived Tajima’s D (D*) score calculation for TRAP (Ectodomain) with incorporation of protein structural information using a 15Å window. TRAP (ectodomain) (PDB Code: 4HQF.A) was used. The structure was coloured according to D* scores mapped to each residue, and undefined D* are shown in white. Residues S123, R130 and R140 are involved in mediating heparin binding. Sample sizes: Malawi (n = 133), Ghana (n = 238), Cambodia (n = 430), and PNG (n = 112). b. Spatially derived Tajima’s D (D*) calculations for AMA1 with incorporation of protein structural information using 15 Å window. The manually modelled structure of AMA1 was used based on published results [27]. The structure was coloured according to D* scores mapped to each residue with undefined D* scores were shown in white. The DI, DII, DIII, and surface exposed c1L loop are indicated. Sample sizes: Malawi (n = 139), Ghana (n = 243), Cambodia (n = 433), and PNG (n = 112). c. Polymorphism and evidence of selection for ripr. Tajima’s D statistic calculated for disordered regions of RIPR in samples from Cambodia, PNG, Malawi, and Ghana. Tajima’s D is calculated with a sliding window approach (a window size of 50 bp and a step size of 5 bp). Nucleotide positions based on coding region are shown in the x-axis. Sample size for each respective population are as follows: Malawi (n = 137), Ghana (n = 246), Cambodia (n = 428), and PNG (n = 111). https://doi.org/10.1371/journal.pcbi.1009801.g005 High D* scores (2.0–2.4) were found in all populations at the C1-L cluster [41,42] of AMA1 Domain I (AA residue: T194 –D212, 3D7 sequence), known to be associated with immune escape [16,43–45] (Fig 5B). Most of these residues were identified as discontinuous epitopes by the IEDB epitope prediction resource [46]. As expected, the entire AMA1 interface with hydrophobic binding cleft and RON2 interacting sites shows high balancing selection [47]. Similar to previous spatially derived analysis [16], the region on the border of DII and DIII domains (AA residues: P303 –F312, S432 –Y446, I479 –K508, 3D7 sequence) appears to have the highest D* scores (> 3) in all populations. This region was previously predicted as the surface exposed face of DII/DIII and suggests that these residues might be the targets of protective host immune responses within AMA1. A monoclonal antibody (1E10) against AMA1 DIII functions synergistically with antibodies against distant parts of AMA1 to inhibit merozoite growth [48]. This indicates that DIII of AMA1 plays a significant role as a target of functional antibody responses against P. falciparum in the context of natural infections. Consistent with previous findings [16,27,49], the spatially derived Tajima’s D values are unevenly distributed with high D* values mostly exclusive to one face of the AMA1 molecule, and close to zero on the opposing face of AMA1, which has been previously been described as the "silent face" [49,50] except in PNG where moderately high D* scores (1.9) were found at AA residues V524 –Y532. This is in line with the previous hypothesis that silent face of AMA1 has minimal exposure to the immune system [45,49]. The high level of spatial Tajima’s D* and nucleotide diversity was similar amongst populations for AMA1 suggesting that this antigen experiences similar selective pressures across different populations (Figs 5B, and S5). Ripr was analysed in the context of the linear nucleotide sequence due to the poor predicted structure. The N-terminal region of RIPR contains 2 EGF-like domains, while the C-terminal region contains 8 EGF-like domains [51]. We found some degree of balancing selection (moderately high D values = 1.5) and high nucleotide diversity proximal to C-terminal EGF 7–10 region, located away from the CyRPA:RIPR interface [51] (nucleotide residues based on coding region 2900–2980, 3D7 sequence) of ripr at most of the populations (Fig 5C). These results were aligned with the anticipated RIPR-specific monoclonal antibody binding sites identified in a previous study [52]. This suggests the C-terminal region of RIPR is a target of host immunity in most populations. While the crystal structure of EBA-175-RII has been solved [53], we used a 3D7-based ModPipe homology structure for analysis, as this model includes a number of functionally important residues that were not resolved in the experimental structure [16]. For D* analysis on EBA-175-RII, a large region of the F1 domain which predominantly consisted of AA residues E226—L294, I312 –K324, and W377 –I400 was under balancing selection (i.e. high D* scores of 2.8–3.0) in all parasite populations. This cysteine-rich region is also involved in the dimerization interface formation (helix linker and disulphide bridges)[53] between two molecules of EBA-175-RII as it binds to human receptor glycophorin A. During dimerization, this region also makes contact with another cysteine rich F2 domain of the other dimer pair in a ‘handshake’ interaction [53]. However, slight variations between African and Asia-Pacific populations can also be found within the site from F2 domains, which are predominantly comprised of AA residues C476–C488 and Y710 –F722 where high D* scores (1.8–2.0) were observed within most Asia-Pacific countries, but not African countries (D* scores < 0.4) (Fig 6A). The F2 domain makes most of the contact with glycan [54] and residues C476–C488 are a part of F2 β-finger domain that is also a target of inhibitory antibodies, R215 and R217 [54,55]. The linker region was found to be conserved within all populations (Fig 6A). However, the limited nucleotide diversity was observed amongst geographical regions. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 6. Antigens displaying geographically variable balancing selection hotspots. Antigens were not normalised based on their sizes. a. Spatially derived Tajima’s D (D*) for EBA175 with incorporation of protein structural information using 15 Å window. 3D7-based ModPipe model of EBA175 (RII) based on 1ZRO template was used. Structure was coloured according to D* scores mapped to each residue with undefined D* were shown in white. The highlighted region (in circle) shows different D* scores between Asia-Pacific and African countries. Sample sizes: Malawi (n = 136), Ghana (n = 237), Cambodia (n = 432), and PNG (n = 112). b. Spatially derived Tajima’s D (D*) calculation for countries from Asia-Pacific and Africa for RH5 with incorporation of protein structural information using 15Å window. Cryo-EM structure of RH5-CyRPA complex (PDB code: 6MPV.B) was used. Structure was coloured according to D* scores mapped to each residue with undefined D* and CyRPA were shown in white. The circle indicates Basigin-binding sites where different D* scores were observed between Asia-Pacific and African countries. Sample sizes: Malawi (n = 142), Ghana (n = 249), Cambodia (n = 433), and PNG (n = 112). c. Spatially derived Tajima’s D (D*) calculations for populations from Asia-Pacific and African regions for CSP (C-terminal) with incorporation of protein structural information using 15 Å window. The crystal structure of the thrombospondin receptor (TSR) domain of CSP [58] (PDB code: 3VDJ, chain A, AA residues: Y306—H376, 3D7 sequence), which consists of Th2R and Th3R (T-cell epitopes) was used [59]. Structure was coloured according to D* scores mapped to each residue with undefined D* were shown in white. The highlighted region (in circle) shows different D* scores between Asia-Pacific and African countries. Sample sizes: Malawi (n = 135), Ghana (n = 223), Cambodia (n = 431), and PNG (n = 111). d. Tajima’s D (D*) calculation for geographic area or countries from the Asia-Pacific and African regions for MSP1 19 with incorporation of protein structural information using 15 Å window. The structured region of MSP1 (MSP1-19, AA residue: N1607—S1699) is based on the ModPipe homology model using template (PDB code: 1OB1) [60]. The structure was coloured according to D* scores mapped to each residue with undefined D* were shown in white. The circle indicates variable balancing selection hotspots between Asia-Pacific and African countries. Sample sizes: Malawi (n = 101), Ghana (n = 183), Cambodia (n = 270), and PNG (n = 72). https://doi.org/10.1371/journal.pcbi.1009801.g006 The cryo-EM structure of full length RH5 (PDB Code: 6MPV, chain B) was used in our analysis [56]. Evidence of balancing selection was limited within African countries parasite populations based on the spatially derived Tajima’s D* scores. In contrast, within Asia-Pacific countries, moderate to high D* scores (1.5–2.0) were consistently observed at the Basigin (BSG or CD147) binding sites and appear to be under balancing selection (AA residues: S189 –D207, 3D7 sequence) (Fig 6B). Additionally, the residues near CyRPA binding sites which predominantly consist of residues I386 –F421 have moderate evidence of balancing selection (D* score of 1.5) within PNG population (Fig 6B). This suggests the BSG binding site may be a key target of host immunity and aligns with previous findings from Alanine et al. (2019) [57]. Non-synonymous AA polymorphisms at residues Y147 and H148 (nucleotide positions 439, and 442, 3D7 sequence) were under balancing selection in most Asia populations but appear to be under adaptative selection in some African populations according to the linear sliding window analysis with Tajima’s D values around 1.7 (S3 Fig). These residues did not have PDB coordinates (physical proximity to N-terminal disordered region), and therefore cannot be detected via spatially derived 3D analyses. Nucleotide diversity of RH5 was slightly higher in Asia-Pacific populations than African populations (S3 Fig). We observed limited evidence of balancing selection (near neutral) within the thrombospondin receptor domain of CSP [58,59] for Asia-Pacific populations (Cambodia and PNG). However, moderately high D* scores (1.2–1.3) were observed at Th2R residues (AA residues E310 –L327, 3D7 sequence) at Malawi and Ghana populations which suggests some evidence of balancing selection (Fig 6C). Similarly, nucleotide diversity of some of the residues comprising thrombospondin receptor domain of CSP were relatively low in Asia-Pacific populations compared to African populations (S7 Fig). A moderately high to high degree of balancing selection (D* scores of 1.0 and 2.9) for Th3R residues (AA residues G341 –I364) was also observed within African populations (Fig 6C). Moderately high D* scores (1.0–2.0) are found within residues Q1612 –F1625, and E1632 –C1647 from African populations (Malawi and Ghana), but not in Asia-Pacific populations (Fig 6D). Most of these residues were a part of conformational epitopes and involved in interaction with potent monoclonal antibody (G17.12) [60] and inter-domain interactions. In addition, nucleotide diversity of these residues is relatively high in African populations compared to Asia-Pacific parasite populations (S8 Fig). However, different part of MSP1-19 with residues P1651 –C1682 were under balancing selection in most of the Asia-Pacific populations (only shown here for Cambodia and PNG populations) (Fig 6D). Hotspots of balancing selection were found within N-terminal AA residues F87 and D93 –S104 (3D7 sequence) of CelTOS [61] for most of the populations with moderately high to high spatially derived Tajima’s D* scores (1.0–2.2) (S7A Fig). Within the Malawi population, we found high D* scores (1.8–2.0) at AA resides D131 –I138 (3D7 sequence), but limited balancing selection was found for CelTOS within PNG (D* scores < 0.4) (S7A Fig). Nucleotide diversity was also relatively low in PNG populations compared to others (S7B Fig). However, the presence and distribution of D* values were variable amongst geographic locations (S7A Fig). Limited functional information was available for CelTOS therefore the significance of these findings is not able to be postulated. [END] [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009801 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/