Predicting Protein Phosphorylation Sites Based on Deep Learning

Author(s): Haixia Long*, Zhao Sun, Manzhi Li, Hai Yan Fu, Ming Cai Lin*

Journal Name: Current Bioinformatics

Volume 15 , Issue 4 , 2020

Become EABM
Become Reviewer

Graphical Abstract:


Background: Protein phosphorylation is one of the most important Post-translational Modifications (PTMs) occurring at amino acid residues serine (S), threonine (T), and tyrosine (Y). It plays critical roles in protein structure and function predicting. With the development of novel high-throughput sequencing technologies, there are a huge amount of protein sequences being generated and stored in databases.

Objective: It is of great importance in both basic research and drug development to quickly and accurately predict which residues of S, T, or Y can be phosphorylated.

Methods: In order to solve the problem, a novel hybrid deep learning model with a convolutional neural network and bi-directional long short-term memory recurrent neural network (CNN+BLSTM) is proposed for predicting phosphorylation sites in proteins. The model contains a list of layers that transform the input data into an output class, in which the convolution layer captures higher-level abstraction features of amino acid, while the recurrent layer captures long-term dependencies between amino acids to improve predictions. The joint model learns interactions between higher-level features derived from the protein sequence to predict the phosphorylated sites.

Results: We applied our model together with two canonical methods namely iPhos-PseEn and MusiteDeep. A 5-fold cross-validation process indicated that CNN+BLSTM outperforms the two competitors in various evaluation metrics like the area under the receiver operating characteristic and precision-recall curves, the Matthews correlation coefficient, F-measure, accuracy, and so on.

Conclusion: CNN+BLSTM is promising in identifying potential protein phosphorylation for further experimental validation.

Keywords: Phosphorylation sites, deep learning, convolutional neural network, bi-directional long short-term memory recurrent neural network, ROC curve, Precision-recall curve.

Jia C, Zuo Y, Zou Q. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 2018; 34(12): 2029-36.
[] [PMID: 29420699]
Zeng X, Liu L, Lü L, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics 2018; 34(14): 2425-32.
[] [PMID: 29490018]
Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999; 294(5): 1351-62.
[] [PMID: 10600390]
Kim JH, Lee J, Oh B, Kimm K, Koh I. Prediction of phosphorylation sites using SVMs. Bioinformatics 2004; 20(17): 3179-84.
[] [PMID: 15231530]
Li A, Wang L, Shi Y, Wang M, Jiang Z, Feng H. Phosphorylation site prediction with a modified k-nearest neighbor algorithm and BLOSUM62 matrix. Conf Proc IEEE Eng Med Biol Soc 2005; 2005: 6075-8.
[PMID: 17281648]
Tang YR, Chen YZ, Canchaya CA, Zhang Z. GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng Des Sel 2007; 20(8): 405-12.
[] [PMID: 17652129]
Qiu WR, Xiao X, Xu ZC, Chou KC. iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier. Oncotarget 2016; 7(32): 51270-83.
[] [PMID: 27323404]
Wei L, Xing P, Tang J, Zou Q. PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only. IEEE Trans Nanobioscience 2017; 16(4): 240-7.
[] [PMID: 28166503]
Eickholt J, Cheng J. DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinformatics 2013; 14: 88-98.
[] [PMID: 23497251]
Leung MKK, Xiong HY, Lee LJ, Frey BJ. Deep learning of the tissue-regulated splicing code. Bioinformatics 2014; 30(12): i121-9.
[] [PMID: 24931975]
Nguyen N, Tran V, Ngo D, et al. DNA sequence classification by convolutional neural network. J Biomed Sci Eng 2016; 9: 280-6.
Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 2016; 44(11) e107
[] [PMID: 27084946]
Wang D, Zeng S, Xu C, et al. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics 2017; 33(24): 3909-16.
[] [PMID: 29036382]
Wei L, Ding Y, Su R, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 2018; 117: 212-7.
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2011; 273(1): 236-47.
[] [PMID: 21168420]
Yan Y, Chen M, Shyu ML, Chen SC. Deep learning for imbalanced multimedia data classification. IEEE International Symposium on Multimedia (ISM) 2015; 483-8.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 1097-105.
Sundermeyer M, Alkhouli T, Wuebker J, Ney H. Translation Modeling with Bidirectional Recurrent Neural Networks. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014; 14-25.
Zhu W, Lan C, Xing J, et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM Networks. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) 2016; 3697-703.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2020
Page: [300 - 308]
Pages: 9
DOI: 10.2174/1574893614666190902154332
Price: $65

Article Metrics

PDF: 12