Background: With the rapid development of the sequencing methods in recent years,
binding sites have been systematically identified in such projects as Nested-MICA and MEME.
Prediction of DNA motifs with higher accuracy and precision has been a very important task for
bioinformaticians. Nevertheless, experimental approaches are still time-consuming for big data set,
making computational identification of binding sites indispensable.
Objective: To facilitate the identification of the binding site, we proposed a deep learning
architecture, named Deep-BSC (Deep-Learning Binary Search Classification), to predict binding
sites in a raw DNA sequence with more precision and accuracy.
Methods: Our proposed architecture purely relies on the raw DNA sequence to predict the binding
sites for protein by using a convolutional neural network (CNN). We trained our deep learning
model on binding sites at the nucleotide level. DNA sequence of A. thaliana is used in this study
because it is a model plant.
Results: The results demonstrate the effectiveness and efficiency of our method in the classification
of binding sites against random sequences, using deep learning. We construct a CNN with different
layers and filters to show the usefulness of max-pooling technique in the proposed method. To gain
the interpretability of our approach, we further visualized binding sites in the saliency map and
successfully identified similar motifs in the raw sequence. The proposed computational framework
is time and resource efficient.
Conclusion: Deep-BSC enables the identification of binding sites in the DNA sequences via a
highly accurate CNN. The proposed computational framework can also be applied to problems such
as operator, repeats in the genome, DNA markers, and recognition sites for enzymes, thereby
promoting the use of Deep-BSC method in life sciences.