Background: Chromosomal DNA contains most of the genetic information of
eukaryotes and plays an important role in the growth, development and reproduction of living
organisms. Most chromosomal DNA sequences are known to wrap around histones, and
distinguishing these DNA sequences from ordinary DNA sequences is important for understanding
the genetic code of life. The main difficulty behind this problem is the feature selection process.
DNA sequences have no explicit features, and the common representation methods, such as onehot
coding, introduced the major drawback of high dimensionality. Recently, deep learning models
have been proved to be able to automatically extract useful features from input patterns.
Objective: We aim to investigate which deep learning networks could achieve notable
improvements in the field of DNA sequence classification using only sequence information.
Methods: In this paper, we present four different deep learning architectures using convolutional
neural networks and long short-term memory networks for the purpose of chromosomal DNA
sequence classification. Natural language model (Word2vec) was used to generate word embedding
of sequence and learn features from it by deep learning.
Results: The comparison of these four architectures is carried out on 10 chromosomal DNA
datasets. The results show that the architecture of convolutional neural networks combined with
long short-term memory networks is superior to other methods with regards to the accuracy of
chromosomal DNA prediction.
Conclusion: In this study, four deep learning models were compared for an automatic classification
of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have
regarded DNA sequences as natural language and extracted word embedding with Word2Vec to
represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten
classification tasks. The reason for this success is that the CNN module captures the regulatory
motifs, while the following LSTM layer captures the long-term dependencies between them.