Citrullination Site Prediction by Incorporating Sequence Coupled Effects into PseAAC and Resolving Data Imbalance Issue

Author(s): Md. Al Mehedi Hasan, Md Khaled Ben Islam, Julia Rahman*, Shamim Ahmad.

Journal Name: Current Bioinformatics

Volume 15 , Issue 3 , 2020

Become EABM
Become Reviewer

Graphical Abstract:


Background: Post-translational modification is one of the bio-molecular mechanisms in living organisms, which incorporate functional diversity in proteins as well as regulate cellular processes. Transformation of arginine residue to citrulline in protein is such a modification.

Objective: Our objective is to identify citrullinated arginine residue sites quickly and accurately.

Methods: In this study, a novel computational tool, abbreviated as predCitru-Site, has been developed to predict citrullination sites. This technique effectively has incorporated the sequencecoupling effect of surrounding amino acids of arginine residues as well as optimizes skewed training citrullination dataset for prediction quality improvement. The performance of predCitru- Site has been measured from the average of 5 complete runs of the 10-fold cross-validation test to comply with existing tools.

Results and Conclusion: predCitru-Site has achieved 97.6% sensitivity, 98.9% specificity, and overall accuracy of 98.5%. With Matthew’s correlation coefficient of 0.967, it has also shown an area under the receiver operator characteristics curve of 0.997. Compared with existing tools, predCitru-Site significantly outperforms on the same benchmark dataset. It also shows significant improvement in the case of independent tests in all performance metrics (around 50% higher in AUC). These results suggest that our method is promising and can be used as a complementary technique for fast exploration of citrullination in arginine residue. A user-friendly web server has also been deployed at for the convenience of experimental scientists.

Keywords: Citrullination sites prediction, sequence-coupling model, general PseAAC, data imbalance issue, support vector machine, computational.

Lin H, Caroll KS. Introduction: posttranslational protein modification. Chem Rev 2018; 118(3): 887-8.
Krassowski M, Paczkowska M, Cullion K, et al. ActiveDriverDB: human disease mutations and genome variation in post-translational modification sites of proteins. Nucleic Acids Res 2018; 46(D1): D901-10.
Cau L, Méchin MC, Simon M. Peptidylarginine deiminases and deiminated proteins at the epidermal barrier. Exp Dermatol 2018; 27(8): 852-8.
Ju Z, Wang SY. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition. Gene 2018; 664: 78-83.
Clancy KW, Weerapana E, Thompson PR. Detection and identification of protein citrullination in complex biological systems. Curr Opin Chem Biol 2016; 30: 1-6.
Härmä H, Tong-Ochoa N, van Adrichem AJ, Jelesarov I, Wennerberg K, Kopra K. Toward universal protein post-translational modification detection in high throughput format. Chem Commun (Camb) 2018; 54(23): 2910-3.
Tutturen AEV. Enrichment and identification of citrullinated proteins in biological samples 2014.
Xu H, Zhou J, Lin S, Deng W, Zhang Y, Xue Y. PLMD: An updated data resource of protein lysine modifications. J Genet Genomics 2017; 44(5): 243-50.
Qiu WR, Sun BQ, Tang H, Huang J, Lin H. Identify and analysis crotonylation sites in histone by using support vector machines. Artif Intell Med 2017; 83: 75-81.
Yadav S, Gupta M, Bist AS. Prediction of ubiquitination sites using UbiNets. Adv Fuzzy Syst 2018; 5125103: 1-10.
Chen G, Cao M, Luo K, Wang L, Wen P, Shi S. ProAcePred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization. Bioinformatics 2018; 34(23): 3999-4006.
Yang Y, Wang H, Ding J, Xu Y. iAcet-Sumo: Identification of lysine acetylation and sumoylation sites in proteins by multi-class transformation methods. Comput Biol Med 2018; 100: 144-51.
Chen CW, Tu CH, Chu YW, Eds. Sumoylation Sites Prediction by Machine Learning Approaches. 2018 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW)
López Y, Sharma A, Dehzangi A, et al. Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics 2018; 19(Suppl. 1): 923.
Hasan MM, Khatun MS, Mollah MNH, Yong C, Guo D. A systematic identification of species-specific protein succinylation sites using joint element features information. Int J Nanomedicine 2017; 12: 6303-15.
Hasan MM, Khatun MS, Kurata H. Large-scale assessment of bioinformatics tools for lysine succinylation sites. Cells 2019; 8(2): 95.
Zhang Y, Xie R, Wang J, et al. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief Bioinform 2019; 20(6): 2185-99.
Taherzadeh G, Yang Y, Xu H, Xue Y, Liew AWC, Zhou Y. Predicting lysine-malonylation sites of proteins using sequence and predicted structural features. J Comput Chem 2018; 39(22): 1757-63.
Li F, Li C, Marquez-Lago TT, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 2018; 34(24): 4223-31.
Hasan MM, Rashid MM, Khatun MS, Kurata H. Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information. Sci Rep 2019; 9(1): 8258.
Deng L, Xu X, Liu H. PredCSO: an ensemble method for the prediction of S-sulfenylation sites in proteins. Mol Omics 2018; 14(4): 257-65.
Al-Barakati HJ, McConnell EW, Hicks LM, Poole LB, Newman RH, Kc DB. SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites. Sci Rep 2018; 8(1): 11288.
Hasan MM, Zhou Y, Lu X, Li J, Song J, Zhang Z. Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs. PLoS One 2015; 10(6) e0129635
Chen Z, Zhou Y, Zhang Z, Song J. Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features. Brief Bioinform 2015; 16(4): 640-57.
Zhang Q, Sun X, Feng K, et al. Predicting citrullination sites in protein sequences using mRMR method and random forest algorithm. Comb Chem High Throughput Screen 2017; 20(2): 164-73.
Jia C, Zuo Y. Computational prediction of protein O-GlcNAc modification computational. Methods Mol Biol 2018; 1754: 235-46.
Xu Y, Song J, Wilson C, Whisstock JC. PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci Rep 2018; 8(1): 8240.
Jia J, Liu Z, Xiao X, Liu B, Chou KC. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem 2016; 497: 48-56.
Jeatrakul P, Wong KW, Fung CC, Takama Y. IEEE Misclassification analysis for the class imbalance problem. World Automation Congress. Kobe, Japan. 2010.
Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem 2015; 474: 69-77.
Hasan MA, Li J, Ahmad S, Molla MK. predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue. Anal Biochem 2017; 525: 107-13.
Hasan MA, Ahmad S, Molla MK. iMulti-HumPhos: a multi-label classifier for identifying human phosphorylated proteins using multiple kernel learning based support vector machines. Mol Biosyst 2017; 13(8): 1608-18.
Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. International Joint Conference on AI 1999.
Chou KC. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. J Biol Chem 1993; 268(23): 16938-48.
Hasan MM, Khatun MS, Mollah MNH, Yong C, Dianjing G, Dianjing G. NTyroSite: Computational identification of protein nitrotyrosine sites using sequence evolutionary features. Molecules 2018; 23(7): 1667.
Dehzangi A, López Y, Lal SP, et al. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One 2018; 13(2) e0191900
Ning Q, Zhao X, Bao L, Ma Z, Zhao X. Detecting Succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinformatics 2018; 19(1): 237.
Jia J, Liu Z, Xiao X, Liu B, Chou KC. iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 2016; 7(23): 34558-70.
Lo Monte M, Manelfi C, Gemei M, Corda D, Beccari AR. ADPredict: ADP-ribosylation site prediction based on physicochemical and structural descriptors. Bioinformatics 2018; 34(15): 2566-74.
Ju Z, He JJ. Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. Anal Biochem 2018; 550: 1-7.
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2018; 34(22): 3835-42.
Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites. Mol Ther Nucleic Acids 2018; 11: 468-74.
Su ZD, Huang Y, Zhang ZY, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 2018; 34(24): 4196-204.
Mirzaei Mehrabad E, Hassanzadeh R, Eslahchi C. PMLPR: a novel method for predicting subcellular localization based on recommender systems. Sci Rep 2018; 8(1): 12006.
Tang H, Zou P, Zhang C, Chen R, Chen W, Lin H. Identification of apolipoprotein using feature selection technique. Sci Rep 2016; 6: 30441.
Rahman J, Mondal MNI, Islam MKB, Hasan MAM, Amin SMS. Gram-positive bacterial protein subcellular localization prediction using features fusion strategy. 9th International Conference on Electrical and Computer Engineering (ICECE) 2016; 20-2. Dec; 2016.
Rahman J, Mondal MNI, Islam MKB, Hasan MAM. Feature fusion based SVM classifier for protein subcellular localization prediction. J Integr Bioinform 2016; 13(1): 288.
Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 2018; 110(5): 239-46.
Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics 2016; 32(20): 3116-23.
Jia C, Zuo Y. S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol 2017; 422: 84-9.
Khan YD, Rasool N, Hussain W, Khan SA, Chou KC. iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Anal Biochem 2018; 550: 109-16.
Lee CY, Wang D, Wilhelm M, et al. Mining the human tissue proteome for protein citrullination. Mol Cell Proteomics 2018; 17(7): 1378-91.
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2011; 273(1): 236-47.
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005; 21(1): 10-9.
Chou KC. A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci 1995; 4(7): 1365-83.
Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 2013; 1 e171
Xu Y, Wen X, Shao XJ, Deng NY, Chou KC. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int J Mol Sci 2014; 15(5): 7594-610.
Chou KC. Prediction of tight turns and their types in proteins. Anal Biochem 2000; 286(1): 1-16.
Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 2014; 9(8) e105018
Wang SP, Zhang Q, Lu J, Cai YD. Analysis and prediction of nitrated tyrosine sites with the mRMR method and support vector machine algorithm. Curr Bioinform 2018; 13(1): 3-13.
Hasan MAM, Ahmad S, Molla MKI. Protein subcellular localization prediction using multiple kernel learning based support vector machine. Mol Biosyst 2017; 13(4): 785-95.
Mehedi Hasan A, Ahmad S, Molla KI. Prediction of protein subcellular localization using support vector machine with the choice of proper kernel. BioTechnologia 2017; 98(2): 85-96.
Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 2004; 17(1): 113-26.
Scholkopf B, Smola AJ. Learning with Kernels: Support Vector Machines. Regularization, Optimization, and Beyond 2001.
Vapnik V. Statistical Learning Theory. John Wiley & Sons Inc. New York 1998.
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009; 21(9): 1263-84.
Hu J, Li Y, Zhang Y, Yu DJ. ATPbind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons. J Chem Inf Model 2018; 58(2): 501-10.
Wei ZS, Han K, Yang JY, Shen HB, Yu DJ. Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016; 193: 201-12.
Hu J, Li Y, Yan WX, Yang JY, Shen HB, Yu DJ. KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning. Neurocomputing 2016; 191: 363-73.
Feng P, Yang H, Ding H, Lin H, Chen W, Chou KC. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 2019; 111(1): 96-102.
Tatjewski M, Kierczak M, Plewczynski D. Predicting post-translational modifications from local sequence fragments using machine learning algorithms: Overview and best practices. Methods Mol Biol 2017; 1484: 275-300.
Chen Z, Liu X, Li F, et al. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform 2019; 20(6): 2267-90.
Qiu WR, Jiang SY, Xu ZC, Xiao X, Chou KC. iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 2017; 8(25): 41178-88.
Jia J, Zhang L, Liu Z, Xiao X, Chou KC. pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 2016; 32(20): 3133-41.
Jia J, Liu Z, Xiao X, Liu B, Chou KC. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition. J Biomol Struct Dyn 2016; 34(9): 1946-61.
Jiao YS, Du PF. Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou’s general pseudo-amino acid compositions. J Theor Biol 2017; 416: 81-7.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2020
Page: [235 - 245]
Pages: 11
DOI: 10.2174/1574893614666191202152328
Price: $65

Article Metrics

PDF: 13