Background: Protein phosphorylation is one of the most important Post-translational Modifications (PTMs) occurring at amino acid residues serine (S), threonine (T), and tyrosine (Y). It plays critical roles in protein structure and function predicting. With the development of novel high-throughput sequencing technologies, there are a huge amount of protein sequences being generated and stored in databases.
Objective: It is of great importance in both basic research and drug development to quickly and accurately predict which residues of S, T, or Y can be phosphorylated.
Methods: In order to solve the problem, a novel hybrid deep learning model with a convolutional neural network and bi-directional long short-term memory recurrent neural network (CNN+BLSTM) is proposed for predicting phosphorylation sites in proteins. The model contains a list of layers that transform the input data into an output class, in which the convolution layer captures higher-level abstraction features of amino acid, while the recurrent layer captures long-term dependencies between amino acids to improve predictions. The joint model learns interactions between higher-level features derived from the protein sequence to predict the phosphorylated sites.
Results: We applied our model together with two canonical methods namely iPhos-PseEn and MusiteDeep. A 5-fold cross-validation process indicated that CNN+BLSTM outperforms the two competitors in various evaluation metrics like the area under the receiver operating characteristic and precision-recall curves, the Matthews correlation coefficient, F-measure, accuracy, and so on.
Conclusion: CNN+BLSTM is promising in identifying potential protein phosphorylation for further experimental validation.