Relevance of Machine Learning Techniques and Various Protein Features in Protein Fold Classification: A Review

Author(s): Komal Patil*, Usha Chouhan.

Journal Name: Current Bioinformatics

Volume 14 , Issue 8 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: Protein fold prediction is a fundamental step in Structural Bioinformatics. The tertiary structure of a protein determines its function and to predict its tertiary structure, fold prediction serves an important role. Protein fold is simply the arrangement of the secondary structure elements relative to each other in space. A number of studies have been carried out till date by different research groups working worldwide in this field by using the combination of different benchmark datasets, different types of descriptors, features and classification techniques.

Objective: In this study, we have tried to put all these contributions together, analyze their study and to compare different techniques used by them.

Methods: Different features are derived from protein sequence, its secondary structure, different physicochemical properties of amino acids, domain composition, Position Specific Scoring Matrix, profile and threading techniques.

Conclusion: Combination of these different features can improve classification accuracy to a large extent. With the help of this survey, one can know the most suitable feature/attribute set and classification technique for this multi-class protein fold classification problem.

Keywords: Protein fold, protein features, descriptors, data mining, machine learning, classification.

Crippen GM, Maiorov VN. How many protein folding motifs are there? J Mol Biol 1995; 252(1): 144-51.
Wang ZX. How many fold types of protein are there in nature? Proteins 1996; 26(2): 186-91.
Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res 2000; 28(1): 257-9.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157-82.
Wei L, Zou Q. Recent progress in machine learning-based methods for protein fold recognition. Int J Mol Sci 2016; 17(12): 2118.
Cheng J, Tegge AN, Baldi P. Machine learning methods for protein structure prediction. IEEE Rev Biomed Eng 2008; 1: 41-9.
Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform 2018; 19(2): 231-44.
Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics 2015; 31(21): 3492-8.
Liu B, Zhang D, Xu R, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 2014; 30(4): 472-9.
Chen J, Guo M, Li S, Liu B. ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics 2017; 33(21): 3473-6.
Chen J, Long R, Wang XL, Liu B, Chou KC. dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci Rep 2016; 6: 32333. []. [PMID: 27581095].
Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25(17): 3389-402.
Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 2011; 39(Suppl_2): W29-37.
Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 2011; 9(2): 173-5.
Margelevičius M, Venclovas C. Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison. BMC Bioinformatics 2010; 11(1): 89.
Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J Mol Biol 2000; 295(3): 613-25.
Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001; 17(4): 349-58.
Taguchi YH, Gromiha MM. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinformatics 2007; 8(1): 404.
Dong Q, Zhou S, Guan J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 2009; 25(20): 2655-62.
Chen K, Kurgan L. PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics 2007; 23(21): 2843-50.
Yang JY, Chen X. Improving taxonomy-based protein fold recognition by using global and local features. Proteins 2011; 79(7): 2053-64.
Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 2014; 42(Database issue): D304-9.
Xia J, Peng Z, Qi D, Mu H, Yang J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics 2017; 33(6): 863-70.
Chothia C, Finkelstein AV. The classification and origins of protein folding patterns. Annu Rev Biochem 1990; 59(1): 1007-39.
Chen D, Tian X, Zhou B, Gao J. Profold: Protein fold classification with additional structural features and a novel ensemble classifier. BioMed Research International 2016. 2016: Doi 6802832.
Fauchère JL, Charton M, Kier LB, Verloop A, Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 1988; 32(4): 269-78.
Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974; 185(4154): 862-4.
Charton M, Charton BI. The structural dependence of amino acid hydrophobicity parameters. J Theor Biol 1982; 99(4): 629-44.
Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One 2013; 8(2)e56499
Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH. Recognition of a protein fold in the context of the SCOP classification. Proteins 1999; 35(4): 401-7.
Ibrahim W, Abadeh MS. Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J Theor Biol 2017; 421: 1-15.
Eisenberg D, Schwarz E, Komaromy M, Wall R. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 1984; 179(1): 125-42.
McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics 2000; 16(4): 404-5.
Wang S, Li W, Liu S, Xu J. RaptorX-Property: a web server for protein structure property prediction. Nucleic Acids Res 2016; 44(W1)W430-5
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983; 22(12): 2577-637.
Cheng J, Randall AZ, Sweredoski MJ, Baldi P. CRATCH: a protein structure and structural feature prediction server Nucleic Acids Res 2015; 33(Suppl_2): W72-6.
Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995; 92(19): 8700-4.
Garg A, Bhasin M, Raghava GP. SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search. J Biol Chem 2005; 280(15): 14427-32.
Guo J, Lin Y, Liu X. GNBSL: a new integrative system to predict the subcellular location for Gram-negative bacteria proteins. Proteomics 2006; 6(19): 5099-105.
Shamim MT, Anwaruddin M, Nagarajaram HA. Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 2007; 23(24): 3320-7.
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 2015; 43(W1)W65-71
Liu B, Liu F, Fang L, Wang X, Chou KC. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 2015; 31(8): 1307-9.
Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou KC. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 2015; 31(1): 119-20.
Shen HB, Chou KC. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 2008; 373(2): 386-8.
Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform 2017.
Vapnik VN. An overview of statistical learning theory. IEEE Trans Neural Netw 1999; 10(5): 988-99.
Shen H, Chou KC. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem Biophys Res Commun 2005; 334(1): 288-92.
Shen HB, Chou KC. Ensemble classifier for protein fold pattern recognition. Bioinformatics 2006; 22(14): 1717-22.
Nanni L. A novel ensemble of classifiers for protein fold recognition. Neurocomputing 2006; 69(16-18): 2434-7.
Guo X, Gao X. A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng Des Sel 2008; 21(11): 659-64.
Schäffer AA, Aravind L, Madden TL, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29(14): 2994-3005.
Marchler-Bauer A, Anderson JB, Derbyshire MK, et al. CDD: a conserved domain database for interactive domain family analysis Nucleic acids research 2006; 35(Suppl_1): D237-40.
Shen HB, Chou KC. Predicting protein fold pattern with functional domain and sequential evolution information. J Theor Biol 2009; 256(3): 441-6.
Ghanty P, Pal NR. Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans Nanobioscience 2009; 8(1): 100-10.
Dehzangi A, Phon-Amnuaisuk S, Dehzangi O. Using Random Forest for Protein Fold Prediction Problem: An Empirical Study. J Inf Sci Eng 2010; 26(6): 1941-56.
Dehzangi A, Phon-Amnuaisuk S, Manafi M, Safa S. Using rotation forest for protein fold prediction problem: An empirical study. European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Berlin, Heidelberg. In: Springer; 2010 Apr 7; 217-7.
Yang T, Kecman V, Cao L, Zhang C, Huang JZ. Margin-based ensemble classifier for protein fold recognition. Expert Syst Appl 2011; 38(10): 12348-55.
Faraggi E, Xue B, Zhou Y. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins 2009; 74(4): 847-56.
Bailey TL, Boden M, Buske FA, et al. MEME SUITE: tools for motif discovery and searching Nucleic acids research 2009; 37(suppl_2): W202-8.
Li J, Wu J, Chen K. PFP-RFSM: Protein fold prediction by using random forests and sequence motifs. J Biomed Sci Eng 2013; 6(12): 1161.
Sharma A, Lyons J, Dehzangi A, Paliwal KK. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J Theor Biol 2013; 320: 41-6.
Wold S, Jonsson J, Sjörström M, Sandberg M, Rännar S. DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Anal Chim Acta 1993; 277(2): 239-53.
Feng Z, Hu X. Recognition of 27-class protein folds by adding the interaction of segments and motif information. BioMed Res international 2014; 2014262850
Paliwal KK, Sharma A, Lyons J, Dehzangi A. Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information. BMC Bioinformatics 2014; 15(16)(Suppl. 16): S12.
Paliwal KK, Sharma A, Lyons J, Dehzangi A. A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Trans Nanobioscience 2014; 13(1): 44-50.
Dehzangi A, Paliwal K, Lyons J, Sharma A, Sattar A. A segmentation-based method to extract structural and evolutionary features for protein fold recognition IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2014; 11(3): 510-9.
Lyons J, Biswas N, Sharma A, Dehzangi A, Paliwal KK. Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping. J Theor Biol 2014; 354: 137-45.
Aram RZ, Charkari NM. A two-layer classification framework for protein fold recognition. J Theor Biol 2015; 365: 32-9.
Lyons J, Dehzangi A, Heffernan R, et al. Advancing the accuracy of protein fold recognition by utilizing profiles from hidden Markov models. IEEE Trans Nanobioscience 2015; 14(7): 761-72.
Saini H, Raicar G, Sharma A, et al. Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳s pseudo amino acid composition for protein fold recognition. J Theor Biol 2015; 380: 291-8.
Wei L, Liao M, Gao X, Zou Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans Nanobioscience 2015; 14(6): 649-59.
Faraggi E, Zhang T, Yang Y, Kurgan L, Zhou Y. SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J Comput Chem 2012; 33(3): 259-67.
Cheung NJ, Ding XM, Shen HB. Protein folds recognized by an intelligent predictor based-on evolutionary and structural information. J Comput Chem 2016; 37(4): 426-78.
Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A. Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 2016; 393: 67-74.
Raicar G, Saini H, Dehzangi A, Lal S, Sharma A. Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. J Theor Biol 2016; 402: 117-28.
Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005; 21(7): 951-60.
Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999; 292(2): 195-202.
Saini H, Raicar G, Lal SP, Dehzangi A, Imoto S, Sharma A. Protein Fold Recognition Using Genetic Algorithm Optimized Voting Scheme and Profile Bigram. JSW 2016; 11(8): 756-67.
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med 2017; 79: 1-8.
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 2008; 36(9): 3025-30.
Xia JF, Han K, Huang DS. Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept Lett 2010; 17(1): 137-45.
Moran PA. Notes on continuous stochastic phenomena. Biometrika 1950; 37(1-2): 17-23.
Geary RC. The contiguity ratio and statistical mapping The incorporated statistician 1954; 5(3): 115-46.
Hollas B. An analysis of the autocorrelation descriptor for molecules. J Math Chem 2003; 33(2): 91-101.
Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen 1936; 7(2): 179-88.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [688 - 697]
Pages: 10
DOI: 10.2174/1574893614666190204154038
Price: $65

Article Metrics

PDF: 30