A Review of DNA-binding Proteins Prediction Methods

Author(s): Kaiyang Qu, Leyi Wei, Quan Zou*.

Journal Name: Current Bioinformatics

Volume 14 , Issue 3 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: DNA-binding proteins, binding to DNA, widely exist in living cells, participating in many cell activities. They can participate some DNA-related cell activities, for instance DNA replication, transcription, recombination, and DNA repair.

Objective: Given the importance of DNA-binding proteins, studies for predicting the DNA-binding proteins have been a popular issue over the past decades. In this article, we review current machine-learning methods which research on the prediction of DNA-binding proteins through feature representation methods, classifiers, measurements, dataset and existing web server.

Method: The prediction methods of DNA-binding protein can be divided into two types, based on amino acid composition and based on protein structure. In this article, we accord to the two types methods to introduce the application of machine learning in DNA-binding proteins prediction.

Results: Machine learning plays an important role in the classification of DNA-binding proteins, and the result is better. The best ACC is above 80%.

Conclusion: Machine learning can be widely used in many aspects of biological information, especially in protein classification. Some issues should be considered in future work. First, the relationship between the number of features and performance must be explored. Second, many features are used to predict DNA-binding proteins and propose solutions for high-dimensional spaces.

Keywords: DNA-binding protein, prediction, feature representation methods, measurements, classifiers, web servers.

Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 2014; 9(1): e86703.
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15(4): 328-34.
Liu B, Liu F, Fang L, Wang X, Chou K-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 2015; 31(8): 1307-9.
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One 2014; 9(9): e106691.
Jones KA, Kadonaga JT, Rosenfeld PJ, Kelly TJ, Tjian R. A cellular DNA-binding protein that activates eukaryotic transcription and DNA replication. Cell 1987; 48(1): 79-89.
Liu B. iEnhancer-PsedeKNC: Identification of enhancers and their subgroups based on Pseudo degenerate kmer nucleotide composition. Neurocomputing 2016; 217: 46-52.
Liu B, Fang L, Long R, Lan X, Chou K-C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016; 32(3): 362-9.
Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta 2003; 1648(1-2): 127-33.
Liu B, Liu Y, Jin X, Wang X, Liu B. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci Rep 2016; 6: 33483.
Zhao H, Yang Y, Zhou Y. Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function. Bioinformatics 2010; 26(15): 1857-63.
Jones S, Daley DTA, Luscombe NM, Berman HM, Thornton JM. Protein-RNA interactions: a structural analysis. Nucleic Acids Res 2001; 29(4): 943-54.
Jones S, Barker JA, Nobeli I, Thornton JM. Using structural motif templates to identify proteins with DNA binding function. Nucleic Acids Res 2003; 31(11): 2811-23.
Kono H, Sarai A. Structure-based prediction of DNA target sites by regulatory proteins. Proteins 1999; 35(1): 114-31.
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure 1997; 5(8): 1093-108.
Ponting CP, Schultz J, Milpetz F, Bork P. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res 1999; 27(1): 229-32.
Si J, Zhao R, Wu R. An overview of the prediction of protein DNA-binding sites. Int J Mol Sci 2015; 16(3): 5194-215.
Tanaka I, White SW, Appelt K, Wilson KS, Dijk J. The structure of DNA binding protein II at 6 Å resolution. FEBS Lett 1984; 165(1): 39-42.
Xu R, Zhou J, Liu B, et al. enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning. BioMed Res Int 2014; 2014(1): 294279.
Zhang J, Liu B. PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Int J Mol Sci 2017; 18(9): 1856.
Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 2006; 64(1): 19-27.
Brown SD, Van der Ploeg LH. Single-stranded DNA-protein binding in the procyclic acidic repetitive protein (PARP) promoter of Trypanosoma brucei. Mol Biochem Parasitol 1994; 65(1): 109-22.
Stuiver MH, van der Vliet PC. Adenovirus DNA-binding protein forms a multimeric protein complex with double-stranded DNA and enhances binding of nuclear factor I. J Virol 1990; 64(1): 379-86.
Dhamija S, Aggarwal K, Singh SP, Kumar A. Hybrid-Statistical Machine Translation From English to Hindi. International Journal of Computer Science Trends and Technology 2015; 3(2): 48-53.
Lai HY, Chen XX, Chen W, Tang H, Lin H. Sequence-based predictive modeling to identify cancerlectins. Oncotarget 2017; 8(17): 28169-75.
Zhu PP, Li WC, Zhong ZJ, et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol Biosyst 2015; 11(2): 558-63.
Lin H, Ding C, Yuan LF, Chen W, Ding H, Li ZQ, et al. Predicting Subchloroplast Locations Of Proteins Based on the General Form Of Chou’s Pseudo Amino Acid Composition: Approached From Optimal Tripeptide Composition. Int J Biomath 2013; 6(2): 14.
Lin H, Chen W, Yuan LF, Li ZQ, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61(2): 259-68.
Lin H, Ding H. Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol 2011; 269(1): 64-9.
Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 2011; 84(1): 67-70.
Ding C, Yuan LF, Guo SH, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 2012; 77: 321-8.
Wei LY, Zou Q. SkipCPP-Pred: Promising Prediction Method for Cell-Penetrating Peptides Using Adaptive k-Skip-n-Gram Features on a High-Quality Dataset.Bioinformatics Research and Applications, Isbra 2016 Lecture Notes in Bioinformatics 9683 2016; 299-300.
Tang H, Zou P, Zhang C, Chen R, Chen W, Lin H. Identification of apolipoprotein using feature selection technique. Sci Rep 2016; 6: 30441.
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016; 477(1): 150-4.
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst 2016; 12(4): 1269-75.
Lin H, Liu WX, He J, Liu XH, Ding H, Chen W. Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep 2015; 5: 16964.
Ding H, Li D. Identification of mitochondrial proteins of malaria parasite using analysis of variance. Amino Acids 2015; 47(2): 329-33.
Ding H, Feng PM, Chen W, Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst 2014; 10(8): 2229-35.
Lin H, Chen W, Ding H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 2013; 8(10): e75726.
Ding H, Guo SH, Deng EZ, Yuan LF, Guo FB, Huang J, et al. Prediction of Golgi-resident protein types by using feature selection technique. Chemometr Intell Lab 2013; 124: 9-13.
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol 2008; 252(2): 350-6.
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001; 43(3): 246-55.
Sahu SS, Panda G. A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 2010; 34(5-6): 320-7.
Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2008; 34(4): 565-72.
Mei S. Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning. J of Theor Biol 2012; 310: 80-7.
Yang H, Tang H, Chen XX, et al. Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. BioMed Res Int 2016; 2016: 5413903.
Lin H, Ding H, Guo FB, Huang J. Prediction of subcellular location of mycobacterial protein using feature selection techniques. Mol Divers 2010; 14(4): 667-71.
Lin H, Wang H, Ding H, Chen YL, Li QZ. Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor 2009; 57(3): 321-30.
Lin H, Ding H, Guo FB, Zhang AY, Huang J. Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 2008; 15(7): 739-44.
Sarangi AN, Lohani M, Aggarwal R. Prediction of essential proteins in prokaryotes by incorporating various physico-chemical features into the general form of Chou’s pseudo amino acid composition. Protein Pept Lett 2013; 20(7): 781-95.
Chen C, Chen L, Zou X, Cai P. Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine. Protein Pept Lett 2009; 16(1): 27-31.
Liu B, Chen J, Wang S. Protein Remote Homology Detection by Combining Pseudo Dimer Composition with an Ensemble Learning Method. Curr Proteomics 2016; 13(2): 86-91.
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290(5): 1919-31.
Liu B, Wang X, Zou Q, Dong Q, Chen Q. Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile-Based Protein Representation. Mol Inform 2013; 32(9-10): 775-82.
Lin W-Z, Fang J-A, Xiao X, Chou K-C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011; 6(9): e24756.
Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 2012; 7(9): e46633.
Kawashima S, Ogata H, Kanehisa M. AAindex: Amino Acid Index Database. Nucleic Acids Res 1999; 27(1): 368-9.
Zhao YW, Lai HY, Tang H, Chen W, Lin H. Prediction of phosphothreonine sites in human proteins by fusing different features. Sci Rep 2016; 6: 34817.
Lin H, Ding C, Song Q, et al. The prediction of protein structural class using averaged chemical shifts. J Biomol Struct Dyn 2012; 29(6): 643-9.
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Mol Inform 2015; 34(1): 8-17.
Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. Annotating nucleic acid-binding function based on protein structure. J Mol Biol 2003; 326(4): 1065-79.
Fang Y, Guo Y, Feng Y, Li M. Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 2008; 34(1): 103-9.
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med 2017; 79: 1-8.
Li S, Chen J, Liu B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinformatics 2017; 18(1): 443.
Chen J, Long R, Wang XL, Liu B, Chou K-C. dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci Rep 2016; 6: 32333.
Chen J, Guo M, Li S, Liu B. ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics 2017; 33(21): 3473-6.
Zhang Z, Zhang J, Fan C, Tang Y, Deng L. KATZLGO: Large-scale Prediction of LncRNA Functions by Using the KATZ Measure Based on Multiple Networks. IEEE/ACM Trans on Comput Biol Bioinformatics 2017; 99(1)
Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999; 292(2): 195-202.
Kong L, Kong L, Wang C, Jing R, Zhang L. Predicting Protein Structural Class for Low-Similarity Sequences via Novel Evolutionary Modes of PseAAC and Recursive Feature Elimination. Lett Org Chem 2017; 14(9): 673-83.
Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25(17): 3389-402.
Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem Sci 1998; 23(11): 444-7.
Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 2007; 8(1): 463.
Chou K-C, Shen H-B. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 2007; 360(2): 339-45.
Liu B, Liu F, Wang X, Chen J, Fang L, Chou K-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 2015; 43(W1): W65-71.
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci 2017; 384: 135-44.
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol 2015; 9(Suppl. 1): S10.
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 2015; 5: 15479.
Liu B, Wang X, Chen Q, Dong Q, Lan X, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 2012; 7(9): e46633.
Liu Y, Wang X, Liu B. A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Brief Bioinform 2019; 20(1): 330-46.
Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform 2018; 19(2): 231-44.
Deng L, Chen Z. An Integrated Framework for Functional Annotation of Protein Structural Domains. IEEE/ACM Trans Comput Biol Bioinformatics 2015; 12(4): 902-13.
Shanahan HP, Garcia MA, Jones S, Thornton JM. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res 2004; 32(16): 4732-41.
Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res 2005; 33(20): 6486-93.
Cai Y, He J, Li X, et al. A novel computational approach to predict transcription factor DNA binding preference. J Proteome Res 2009; 8(2): 999-1003.
Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 2004; 20(4): 477-86.
Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget 2017; 8(8): 13338-43.
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20(3): 273-97.
Tang Y-R, Sheng Z-Y, Chen Y-Z, Zhang Z. An improved prediction of catalytic residues in enzyme structures. Protein Eng Des Sel 2008; 21(5): 295-302.
Liu B, Zhang D, Xu R, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 2014; 30(4): 472-9.
Liu B, Fang L, Liu F, Wang X, Chen J, Chou K-C. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 2015; 10(3): e0121501.
Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Sci Rep 2016; 6: 23934.
Chen J, Wang X, Liu B. iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions. Sci Rep 2016; 6: 19062.
Liu B, Fang L, Chen J, Liu F, Wang X. miRNA-dis: microRNA precursor identification based on distance structure status pairs. Mol Biosyst 2015; 11(4): 1194-204.
Zhao YW, Su ZD, Yang W, Lin H, Chen W, Tang H. IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types. Int J Mol Sci 2017; 18(9): 10.
Lin H, Liang ZY, Tang H, Chen W. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform 2017.
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017; 33(22): 3518-23.
Li WC, Deng EZ, Ding H, Chen W, Lin H. iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom Intell Lab Syst 2015; 141: 100-6.
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014; 42(21): 12961-72.
Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 2014; 30(11): 1522-9.
Tang H, Zhang C, Chen R, Huang P, Duan C, Zou P. Identification of Secretory Proteins of Malaria Parasite by Feature Selection Technique. Lett Org Chem 2017; 14(9): 621-4.
Wang X, Zhang Y, Wang J. Prediction of Protein Structural Class Based on ReliefF-SVM. Lett Org Chem 2017; 14(9): 696-702.
Zhang S, Jin J. Prediction of Protein Subcellular Localization by Using λ-Order Factor and Principal Component Analysis. Lett Org Chem 2017; 14(9): 717-24.
Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 2006; 240(2): 175-84.
Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou KC. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget 2016; 7(43): 69783-93.
Ho TK. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal Appl 2002; 5(2): 102-12.
Liu B, Long R, Chou K-C. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 2016; 32(16): 2411-8.
Nimrod G, Szilágyi A, Leslie C, Ben-Tal N. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol 2009; 387(4): 1040-53.
Wu J, Liu H, Duan X, et al. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 2009; 25(1): 30-5.
Dayhoff JE, DeLeo JM. Artificial neural networks: opening the black box. Cancer 2001; 91(8)(Suppl.): 1615-35.
Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000; 22(5): 717-27.
Lu Y, Wang X, Chen X, Zhao G. Computational methods for DNA-binding protein and binding residue prediction. Protein Pept Lett 2013; 20(3): 346-51.
Liu B, Yang F, Chou K-C. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucleic Acids 2017; 7: 267-77.
Liu B, Wang S, Long R, Chou K-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 2017; 33(1): 35-41.
Fan C, Liu D, Huang R, Chen Z, Deng L. PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility. BMC Bioinformatics 2016; 17(Suppl. 1): 8.
Pan Y, Liu D, Deng L. Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PLoS One 2017; 12(6): e0179314.
Zhang J, Zhang Z, Chen Z, Deng L. Integrating Multiple Heterogeneous Networks for Novel LncRNA-disease Association Inference. IEEE/ACM Trans Comput Biol Bioinformatics 2017.
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics 2014; 15(1): 298.
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 2009; 26(6): 679-86.
Hochreiter S, Heusel M, Obermayer K. Fast model-based protein homology detection without alignment. Bioinformatics 2007; 23(14): 1728-36.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [246 - 254]
Pages: 9
DOI: 10.2174/1574893614666181212102030
Price: $58

Article Metrics

PDF: 21