(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile [1] ['Tianqi Yang', 'Department Of Pharmacology', 'Cancer Biology', 'Duke University School Of Medicine', 'Durham', 'North Carolina', 'United States Of America', 'Department Of Cell Biology', 'Ricardo Henao', 'Center For Applied Genomics'] Date: 2022-09 Determining transcriptional factor binding sites (TFBSs) is critical for understanding the molecular mechanisms regulating gene expression in different biological conditions. Biological assays designed to directly mapping TFBSs require large sample size and intensive resources. As an alternative, ATAC-seq assay is simple to conduct and provides genomic cleavage profiles that contain rich information for imputing TFBSs indirectly. Previous footprint-based tools are inheritably limited by the accuracy of their bias correction algorithms and the efficiency of their feature extraction models. Here we introduce TAMC ( T ranscriptional factor binding prediction from A TAC-seq profile at M otif-predicted binding sites using C onvolutional neural networks), a deep-learning approach for predicting motif-centric TF binding activity from paired-end ATAC-seq data. TAMC does not require bias correction during signal processing. By leveraging a one-dimensional convolutional neural network (1D-CNN) model, TAMC make predictions based on both footprint and non-footprint features at binding sites for each TF and outperforms existing footprinting tools in TFBS prediction particularly for ATAC-seq data with limited sequencing depth. Applications of deep learning models are rapidly gaining popularity in recent biological studies because of their efficiency in analyzing non-linear patterns from feature-rich data. In this study, we developed a deep learning method to predict transcription factor binding sites based on chromatin accessibility profiles. Compared to previous methods using scoring functions and classical machine learning algorithms, our method forgoes the need for bias correction during signal processing and significantly increases the efficiency in extracting features at transcription factor binding sites. In addition, we showed that our method outperforms previous methods particularly for chromatin accessibility data with shallow sequencing depth. In this study, we applied our method to prediction of changes in binding sites of a transcription factor, CTCF, during early embryonic development based on bulk chromatin accessibility profiles. We then discussed about the potential application of our method to transcription factor binding site prediction using single-cell chromatin accessibility profiles as well as possible strategies to further improve the performance of our method in the future. (A) Architecture of TAMC framework: three convolutional layers with 3 learnable filters of kernel sizes k = 3, 5, and 7, and ReLU activations, followed by max pooling, concatenation, and prediction of binding probability via a two fully connected layers with sigmoid activation. (B) Default TAMC input signal structure. >1Nr, ATAC-seq read fragments larger than one nucleosome size; ≤1Nr, ATAC-seq read fragments equal or smaller than one nucleosome size. (C) Representative tracks show strategy of labeling bound and unbound CTCF MPBSs in GM12878 cell type at FRMD4B gene locus. MPBS, motif-predicted binding sites. More recently, deep-learning models are rapidly gaining popularity in biological studies because of their efficiency in analyzing complex (non-linear) patterns from feature-rich data. Here, we introduced a new TFBS prediction tool named TAMC ( T ranscriptional factor binding prediction from A TAC-seq profile at M otif-predicted binding sites using C onvolutional neural networks). TAMC takes advantage of signal processing strategies in HINT-ATAC and TOBIAS except for the bias correction step to produce input signals that are then used to extract features of TFBSs using a 1D-convolutional neural network (1D-CNN) model ( Fig 1A ). By evaluating TAMC with different input signal configurations, we showed that TAMC does not require bias correction during signal processing and captures both footprint and non-footprint features of TFBSs efficiently. Importantly, TAMC models pretrained with multiple deeply sequenced ATAC-seq datasets significantly outperform HINT-ATAC and TOBIAS in TFBS prediction especially when using ATAC-seq data with limited sequencing depth. In our study, we have applied TAMC in predicting changes in binding sites of a specific TF, CTCF, during human zygotic genome activation (ZGA) using bulk ATAC-seq data. We also believe that TAMC is a competitive alternative method for TFBS prediction using single-cell ATAC-seq data. Several computational methods have been developed to investigate footprint patterns in chromatin cleavage profiles [ 10 – 19 ]. As chromatin cleavage events generated by cutting enzymes (e.g., DNase-I used in DNase-seq and Tn5 transposase used in ATAC-seq) are biased towards different sequences, previous tools initially designed for DNase-seq data often give poor predictions using ATAC-seq data [ 16 , 20 , 21 ]. By far, HINT-ATAC [ 16 ] and TOBIAS [ 17 ] are two representative footprinting tools specifically designed for ATAC-seq data, which has become the dominant data type for chromatin accessibility profile because of the simplicity of the assay itself. TOBIAS uses a simple footprint score (FPS) metric to characterize the footprint pattern at single-base resolution while HINT-ATAC uses a semi-supervised hidden Markov model (HMM) to predict footprint sites directly. Both tools significantly increase the accuracy in classifying bound/unbound sites from ATAC-seq data. On the other hand, the limitation is that they both require complex bias correction during signal processing because their models/algorithms are highly dependent on measurable footprint pattern to make predictions. Traditional computational TFBS prediction methods have been using position weight matrices (PWMs) of TF binding motifs against DNA sequence to predict TFBSs [ 4 , 5 ], yet these methods suffer from high false-positive rates (FDR) [ 6 ]. Recent studies have shown that more than 90% of TF binding events take place at open chromatin regions [ 7 ] that could be mapped by enzymatic cleavage assays such as DNase-I sequencing (DNase-seq) [ 8 ] and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) [ 7 ]. Notably, bound TFs hinder the activity of cutting enzymes and leave footprint sites characterized with lower cleavage signal frequency comparing to surrounding regions [ 9 ]. Therefore, the TF-bound and unbound sites can be theoretically distinguished by their footprint pattern. Transcription factors (TFs) are proteins that bind to conserved genomic sequence motifs and have functions in regulating gene expression [ 1 ]. Determining TF binding sites (TFBSs) is essential for deciphering the molecular mechanisms regulating gene expression across different biological conditions. Biological assays, such as ChIP-seq [ 2 ] and CUT&RUN [ 3 ], have been used as the standard experimental methods for mapping genome-wide interactions between TFs and chromatin. However, these experiments are resource-intensive and can measure only one TF at one time, which largely limits their applications in many situations. To address these limitations, computational methods, discussed below, have been proposed to impute TFBSs. Results TAMC overview TAMC takes a combination of footprint profile and genomic cleavage profile (signals and slopes) around 500bp from the center of potential TF binding sites predicted by their binding sequence motifs, or motif-predicted binding sites (MPBSs), as input signal (Fig 1B). The footprint scores are calculated using the TOBIAS footprint scores metric, and the genomic cleavage signals and slopes are calculated and normalized using HINT-ATAC input signal processing scripts with modifications at the bias correction step. The genomic cleavage signals and slopes are further separated into 8 channels by strand and size of ATAC-seq reads. This results in each input signal has 9 channels in total– 1 channel for footprint profile and 8 channels for cleavage profiles (Fig 1B). The 9-channel input signals are fed into a 1D-CNN module and the convolution is performed with filter kernels moving at 1 base position (bp) per step in the direction from -500bp to +500bp from the motif center. As the length of binding sites varies between TFs, we utilized kernels with different sizes (k = 3, 5 and 7) to extract features for small and large binding sites (Fig 1A). For each kernel size, we repeated convolution for three times and the obtained three feature maps were max-pooled to extract the most salient feature at each bp within the input region (Fig 1A). The max-pooled feature maps originate from different kernel sizes are then concatenated before being fed to the fully connected layers and a final sigmoid activation function to make predictions of TF binding probability (Fig 1A). The TAMC framework was trained and tested using published paired-end ATAC-seq data of three human cell types (GM12878, HepG2 and K562) (S1 Table). ChIP-seq data obtained from the same cell type as ATAC-seq data were used for labeling MPBSs as bound or unbound (Fig 1C). Because of the high false positive rate of TF motifs [6], a large proportion of MPBSs are not really bound by TFs (S1A Fig). To avoid biases due to training and evaluation with a disproportionally large unbound category, we randomly subsampled the bound and unbound MPBSs to have equal proportions for each TF before conforming training, validation, and testing datasets (S1B Fig). The resulting balanced datasets are large enough (more than 5000 labeled MPBSs for most TFs) to enable reliable model training and performance metrics robust to the random subsampling. The trained TAMC models were tested under two types of prediction settings: intra-data prediction (same ATAC-seq data for training and testing) and cross-data prediction (different ATAC-seq data sets for training and testing). In total, we trained and tested TAMC for 47 TFs with published ChIP-seq data for all three cell types in the ENCODE database (S2 Table). Area under the receiver operating curve (AUROC) was used to measure the trained models’ performance in classifying bound/unbound MPBSs. TAMC outperforms existing methods To evaluate TAMC performance, we compared TAMC predictions with predictions generated using two representative footprinting tools, namely, TOBIAS and HINT-ATAC. For HINT-ATAC, we utilized published models that were pre-trained on GM128782 ATAC-seq data for testing [16]. For TOBIAS, it has fixed parameters in its FPS metric and therefore does not require model training before testing [17]. We showed that TAMC gave the best classification of bound and unbound binding sites for most of the 47 TFs under both intra- and cross-data settings (Fig 2A). We further showed that TAMC models trained using multiple (GM12878 and HepG2) ATAC-seq datasets gave better cross-data predictions than models trained using a single (GM12878 or HepG2) ATAC-seq data (Fig 2B). In order to check the influence of sequencing depth on TAMC performance, we downsized the training and testing ATAC-seq datasets to different total numbers of high-quality non-mitochondrial aligned reads. The results showed that higher sequencing depth of the training datasets gave better cross-data classification performance for TAMC (Fig 2C). Compared to TOBIAS and HINT-ATAC, TAMC gave best cross-data classification when the sequencing depth of testing datasets is low; while for testing data with high sequencing depth, TOBIAS gave better classification than TMAC for certain TFs (Figs 2D and S2A). Interestingly, the classification performance of TOBIAS and TAMC were always crossed near the point where the training and testing ATAC-seq datasets have similar sequencing depth (Figs 2D and S2B–S2C). These results together suggested that TAMC outperforms TOBIAS and HINT-ATAC in cross-data TFBS prediction as long as the training ATAC-seq datasets were sequenced at higher depth than the testing ATAC-seq datasets. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 2. TAMC outperforms existed methods. (A) Heat maps compare intra-data (left) and cross-data (right) classification performance of TAMC with TOBIAS and HINT-ATAC. Test data are indicated above each heatmap, and TAMC models were trained using GM12878 ATAC-seq data with 150M sequencing depth in both heatmaps. (B) Line graph compares cross-data classification performance of TAMC models trained on ATAC-seq data of GM12878, HepG2 and multiple (GM12878 and HepG2) cell lines. (C) Line graph compares cross-data classification performance of TAMC models trained on multiple (GM12878 and HepG2) ATAC-seq data with 150M, 100M and 50M sequencing depth. (D) Line graph compares cross-data classification performance of TAMC (multiple, 150M) with TOBIAS and HINT-ATAC. The performance of models within each line graph were ranked from 1 to 3 for each TF. The higher AUROC is, the lower rank number is given. The points in the line graphs show the average AUROC ranks for 47 TFs for each method/model and the error bars represent standard error or mean (SEM). Difference between the performance of TAMC (multiple, 150M) and the other methods/models were examined using Friedman-Nemenyi test and significant differences are labeled (* p < 0.05; ** p < 0.01; *** p < 0.001). Complete statistic test results for B, C and D are provided in S3 Table. The cell type and sequencing depth of ATAC-seq data used for train TAMC models are indicated within the parenthesis. HINT-ATAC models pre-trained on GM128782 ATAC-seq data were used for all testing, while TOBIAS does not require model training before testing. M denotes million high-quality non-mitochondrial aligned reads. https://doi.org/10.1371/journal.pcbi.1009921.g002 TAMC does not require bias correction during input signal processing Most existing footprinting tools, including TOBIAS and HINT-ATAC, require conducting bias correction during cleavage signal processing to uncover measurable footprint patterns for bound/unbound MPBS classification. However, none of the reported bias correction algorithms can uncover footprint for all TFs faithfully [17]. This makes bias correction a critical step that limits the performance of existing footprint-based methods. To check whether TAMC prediction is affected by bias correction during input signal processing, we compared the classification performance of TAMC models using four different combinations of bias-corrected and uncorrected signals as inputs. Our results showed that the performance of the four TAMC models using uncorrected, partially corrected and fully correct input signals were not significantly different with each other (Fig 3). In addition, all four TAMC models gave better intra-data classification performance than TOBIAS and HINT-ATAC that use bias-corrected signals as optimal inputs (Fig 3). This implies that TAMC models were trained to correct Tn5 cutting bias internally and therefore does not require manual bias correction during input signal processing. Based on these results, we set non-biased corrected input signals as the default input format for TAMC. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 3. TAMC performance is independent of bias correction. Bar graphs compare intra-data classification performance of TAMC models using four different combinations of bias-corrected and uncorrected inputs together with TOBIAS and HINT-ATAC in GM12878 and HepG2 cells. The performance of models within each graph were ranked from 1 to 6 for each TF. The higher AUROC is, the lower the rank number. Average AUROC ranks for 47 TFs for each method/model were shown and the error bars represent SEM. Difference between the performance of default TAMC and the other models were examined using Friedman-Nemenyi test and p-values for significant differences are indicated. Complete statistic test results are provided in S3 Table. bc, bias corrected; nbc, non-bias corrected; ns, non-significant. https://doi.org/10.1371/journal.pcbi.1009921.g003 TAMC generates TF-specific models Both TOBIAS and HINT-ATAC assume that the footprints left by all TFs have the same pattern across the genome–TOBIAS applies the same metric to calculate the footprint scores at all genomic sites and HINT-ATAC uses the model trained for EGR1 to make predictions for all TFs. However, it has been recently reported that the footprint pattern for different TFs are highly heterogeneous from each other [10]. To check whether TAMC can tell the heterogeneity in binding features of different TFs, we compared intra-data prediction performance of TAMC using models trained for the same (intra-TF) or different TFs (cross-TF) as the testing TF (Figs 4 and S4). We found that TAMC give better intra-TF predictions than cross-TF predictions in general (Fig 4). Among the 47 analyzed TFs, 6 TFs (ZNF143, NR2F1, CTCF, MXI1, NFIC and MEF2A) always require TAMC models trained using the same TF for best binding site prediction (Fig 4). In particular, the model trained for CTCF showed extremely high specificity for CTCF binding site prediction (Fig 4). These results suggested that TAMC can tell TF-to-TF differences at their binding sites and generate TF-specific models to improve prediction accuracy. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 4. TAMC generates TF-specific models. Heat map compares intra-TF and cross-TF performance of TAMC using GM12878 ATAC-seq data. For each testing TF, the prediction performance of different TAMC models was scaled using the formula: scaled AUROC = log 2 (AUROC/intra-TF AUROC). Majority of the obtained scaled AUROC values falls within the range from -0.1 to 0.1. Positive scaled AUROC values means better cross-TF predictions than intra-TF predictions, while negative scaled AUROC values means better intra-TF predictions than cross-TF predictions. The rank for intra-TF prediction performance for each testing TF was labeled in the plot. Raw AUROC data are shown in S3 Fig. https://doi.org/10.1371/journal.pcbi.1009921.g004 TAMC captures TF-specific binding features by deep learning To further explore how TAMC exceeds TOBIAS and HINT-ATAC in detecting TFBSs, we tested TAMC with four variant input structures that lack footprint scores (Variant 1), cleavage profile (Variant 2), read strand (Variant 3) and fragment size (Variant 4) information respectively (Fig 5A). By comparing TAMC models using variant 1 and 2 input structures with HINT-ATAC and TOBIAS respectively, we showed that the 1D-CNN model in TAMC makes better predictions than the classical model (e.g., HMM in HINT-ATAC) or non-model metrics (e.g., FPS metric in TOBIAS) even using the same input signals (Fig 5B). Therefore, the high efficiency in capturing complex and subtle features by 1D-CNN model plays an important role in improving TFBS detection ability of TAMC. At the same time, by comparing performance of default and variant TAMC models, we can define what kind of information in the input signals are important for TAMC to make predictions. Our results showed that TAMC performance was drastically compromised for all 47 TFs when the footprint score profile is removed from input (Fig 5B). In addition, loss of cleavage profiles completely impaired TAMC prediction accuracy for more than half of the TFs (Fig 5B). Therefore, both the footprint and cleavage profiles in the inputs provide important information for TAMC to make predictions. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 5. TAMC captures both footprint and non-footprint features of TFBSs by deep learning. (A) Schematic of variant TAMC input structures that lack footprint score (Variant 1), cleavage profile (Variant 2), ATAC-seq read strand (Variant 3) and ATAC-seq read fragment size (Variant 4) information, respectively. (B) Bar plot and heat map compares intra-data classification performance of TAMC using default and variant input structures together with TOBIAS and HINT-ATAC in GM12878 cells. TFs that require footprint score, cleavage profile, strand or size information within the input signal or the 1D-CNN model for better prediction are indicated in the side bar. (C) Bar plot compares intra-data classification performance of TAMC models using inputs with different fragment size separations for the cleavage profile in GM12878 cells. (D) Bar plot compares intra-data prediction performance of TAMC models using inputs generated from different region length surrounding MPBSs in GM12878 cells. The models are ranked by their performance within each plot. The higher AUROC is, the lower rank number is given. Average AUROC ranks for 47 TFs for each model were shown in the bar plots (B, C, and D) and the error bars represent SEM. Difference between the performance of default TAMC and the other models were examined using Friedman-Nemenyi test and p-values for selected comparisons are indicated. Complete statistic test results are provided in S3 Table. https://doi.org/10.1371/journal.pcbi.1009921.g005 Interestingly, we determined several TFs requiring information provided by the cleavage profile, including read strand (e.g., CTCF) and/or read fragment size (e.g., EGR1) information, for better binding site prediction (Fig 5B). In consistent with this finding, we detected strand-specific pattern in metagene cleavage profiles at CTCF binding sites (S4 Fig), which can only be captured by TAMC using inputs generated by strand-separated signals. On the other hand, we detected a small footprint pattern at EGR1 binding sites only in chevage profiles generated by reads fragments smaller than one nucleosome size (S4 Fig). Therefore, merging cleavage profiles of small and large reads fragments might dilute the footprint pattern and reduce required information used for EGR1 binding detection. These results suggest that TAMC generates TF-specific models by recognizing TF-specific binding features from ATAC-seq data. In addition, the strand-specific cleavage pattern around CTCF binding site might reflect its dimerized binding mechanisms [22], while the shallow footprint pattern at EGR1 binding sites could reflect its transient activity at most of its binding sites [23]. Therefore, although the overall performance of TAMC was not significantly affected by strand and/or fragment size separation (Fig 5B and 5C), we expected that the performance of TAMC in predicting binding site of specific TFs could be further optimized by adjusting these two factors based on biological mechanisms behind their binding events. Notably, we found that shortening the input regions significantly reduced TAMC performance (Fig 5D). This indicates that TAMC not only captures cleavage features at TFBSs but also surrounding cleavage features that could be affected by binding of co-factors. [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009921 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/