Background: Protein phosphorylation is one of the most important Post-translational
Modifications (PTMs) occurring at amino acid residues serine (S), threonine (T), and tyrosine (Y).
It plays critical roles in protein structure and function predicting. With the development of novel
high-throughput sequencing technologies, there are a huge amount of protein sequences being
generated and stored in databases.
Objective: It is of great importance in both basic research and drug development to quickly and accurately
predict which residues of S, T, or Y can be phosphorylated.
Methods: In order to solve the problem, a novel hybrid deep learning model with a convolutional
neural network and bi-directional long short-term memory recurrent neural network
(CNN+BLSTM) is proposed for predicting phosphorylation sites in proteins. The model contains a
list of layers that transform the input data into an output class, in which the convolution layer captures
higher-level abstraction features of amino acid, while the recurrent layer captures long-term
dependencies between amino acids to improve predictions. The joint model learns interactions between
higher-level features derived from the protein sequence to predict the phosphorylated sites.
Results: We applied our model together with two canonical methods namely iPhos-PseEn and
MusiteDeep. A 5-fold cross-validation process indicated that CNN+BLSTM outperforms the two
competitors in various evaluation metrics like the area under the receiver operating characteristic
and precision-recall curves, the Matthews correlation coefficient, F-measure, accuracy, and so on.
Conclusion: CNN+BLSTM is promising in identifying potential protein phosphorylation for further