A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods

Jun       Zhang; Bin       Liu

Abstract

Background: Proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of protein structures and functions are critical for both basic research and applications. Because the traditional experiments for studying the structures and functions of proteins are expensive and time consuming, computational approaches are highly desired. In key for computational methods is how to efficiently extract the features from the protein sequences. During the last decade, many powerful feature extraction algorithms have been proposed, significantly promoting the development of the studies of protein structures and functions.

Objective: To help the researchers to catch up the recent developments in this important field, in this study, an updated review is given, focusing on the sequence-based feature extractions of protein sequences.

Method: These sequence-based features of proteins were grouped into three categories, including composition-based features, autocorrelation-based features and profile-based features. The detailed information of features in each group was introduced, and their advantages and disadvantages were discussed. Besides, some useful tools for generating these features will also be introduced.

Results: Generally, autocorrelation-based features outperform composition-based features, and profile-based features outperform autocorrelation-based features. The reason is that profile-based features consider the evolutionary information, which is useful for identification of protein structures and functions. However, profile-based features are more time consuming, because the multiple sequence alignment process is required.

Conclusion: In this study, some recently proposed sequence-based features were introduced and discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features did make great contributions to the developments of protein sequence analysis. Future studies can be focus on exploring the combinations of these features. Besides, techniques from other fields, such as signal processing, natural language process (NLP), image processing etc., would also contribute to this important field, because natural languages (such as English) and protein sequences share some similarities. Therefore, the proteins can be treated as documents, and the features, such as k-mers, top-n-grams, motifs, can be treated as the words in the languages. Techniques from these filed will give some new ideas and strategies for extracting the features from proteins.

Keywords: Amino acids, review, protein structure and function prediction, feature extraction, protein representation, covariance.

« Previous Next »

Graphical Abstract

[1] 
Durek P, Walther D. The integrated analysis of metabolic and protein interaction networks reveals novel molecular organizing principles. BMC Syst Biol  2008; 2(1): 100.
[2] 
Salas M. Protein-priming of DNA replication. Annu Rev Biochem  1991; 60(1): 39-71.
[3] 
Ronson CW, Nixon BT, Ausubel FM. Conserved domains in bacterial regulatory proteins that respond to environmental stimuli. Cell  1987; 49(5): 579-81.
[4] 
Terwilliger NB. Functional adaptations of oxygen-transport proteins. J Exp Biol  1998; 201(Pt 8): 1085-98.
[5] 
Dorsam RT, Gutkind JS. G-protein-coupled receptors and cancer. Nat Rev Cancer  2007; 7(2): 79-94.
[6] 
Zhang J, Ju Y, Lu H, Xuan P, Zou Q. Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology. Int J Genomics  2016; 2016(4): 7604641.
[7] 
Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics  2014; 30(11): 1522-9.
[8] 
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res  2014; 42(21): 12961-72.
[9] 
Lin H, Liang ZY, Tang H, Chen W. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform 2018.
[10] 
Ding H, Li D. Identification of mitochondrial proteins of malaria parasite using analysis of variance. Amino Acids  2015; 47(2): 329-33.
[11] 
Liu B, Fang Y, Huang D-S, Chou K-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformaitcs  2018; 34(1): 33-40.
[12] 
Liu B, Liu F, Wang X, Chen J, Fang L, Chou K-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res  2015; 43(W1): W65-71.
[13] 
Shanahan HP, Garcia MA, Jones S, Thornton JM. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res  2004; 32(16): 4732-41.
[14] 
Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. Annotating nucleic acid-binding function based on protein structure. J Mol Biol  2003; 326(4): 1065-79.
[15] 
Leyi W, Minghong L, Xing G, Quan Z. An improved protein structural classes prediction method by incorporating both sequence and structure information. IEEE Trans Nanobioscience  2015; 14(4): 339-49.
[16] 
Zhang CT, Chou K-C. An optimization approach to predicting protein structural class from amino acid composition. Protein Sci  1992; 1(3): 401-8.
[17] 
Cedano J, Aloy P, Pérez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol  1997; 266(3): 594-600.
[18] 
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics  2008; 9(1): 510.
[19] 
Xu R, Zhou J, Liu B, et al. Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. J Biomol Struct Dyn  2015; 33(8): 1720-30.
[20] 
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics  2008; 9: 510.
[21] 
Liu B, Xu J, Zou Q, Xu R, Wang X, Chen Q. Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics  2014; 15(Suppl. 2): S3.
[22] 
Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ  2013; 1: e171.
[23] 
Liu S, Wang S, Ding H, Eds. Protein sub-nuclear location by fusing AAC and PSSM features based on sequence information. International Conference on Electronics Information and Emergency Communication.  2015
[24] 
Klein P, Delisi C. Prediction of protein structural class from the amino acid sequence. Biopolymers  1986; 25(9): 1659-72.
[25] 
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol  2008; 252(2): 350-6.
[26] 
Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods  2011; 84(1): 67-70.
[27] 
Chou K-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins  2001; 43(3): 246-55.
[28] 
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst  2016; 12(4): 1269-75.
[29] 
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun  2016; 477(1): 150-4.
[30] 
Lin H, Chen W, Yuan LF, Li ZQ, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor  2013; 61(2): 259-68.
[31] 
Lin H, Ding C, Yuan LF, Chen W, Ding H, Li ZQ, et al. Predicting Subchloroplast Locations Of Proteins Based on the General Form Of Chou’s Pseudo Amino Acid Composition: Approached From Optimal Tripeptide Composition. Int J Biomath  2013; 6(2): 1350003.
[32] 
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins  2001; 43(3): 246-55.
[33] 
Chou K-C, Cai YD. Using functional domain composition and support vector machines for prediction of protein subcellular location. J Biol Chem  2002; 277(48): 45765-9.
[34] 
Cai YD, Zhou GP, Chou K-C. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys J  2003; 84(5): 3257-63.
[35] 
Shen HB, Chou K-C. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem  2008; 373(2): 386-8.
[36] 
Chou K-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics  2005; 21(1): 10-9.
[37] 
Lin H, Wang H, Ding H, Chen YL, Li QZ. Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor  2009; 57(3): 321-30.
[38] 
Cao D-S, Xu Q-S, Liang Y-Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics  2013; 29(7): 960-2.
[39] 
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res  2008; 36(Database issue): D202-5.
[40] 
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One  2014; 9(9): e106691.
[41] 
Dong Q, Zhou S, Guan J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics  2009; 25(20): 2655-62.
[42] 
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res  2008; 36(9): 3025-30.
[43] 
Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One  2012; 7(9): e46633.
[44] 
Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res  2000; 28(1): 374.
[45] 
Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res  1997; 25(17): 3389-402.
[46] 
Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics  1998; 14(5): 423-9.
[47] 
Rangwala H, Karypis G. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics  2005; 21(23): 4239-47.
[48] 
Liu B, Zhang D, Xu R, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics  2014; 30(4): 472-9.
[49] 
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci  2017; 2016(384): 135-44.
[50] 
Waris M, Ahmad K, Kabir M, Hayat M. Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing  2016; 199: 154-62.
[51] 
Liu B, Wang S, Wang X. DNA binding protein identifcation by combining pseudo amino acid composition and profle-based protein representation. Sci Rep  2015; 5: 15497.
[52] 
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics  2014; 15: 298.
[53] 
Saini H, Raicar G, Lal SP, Dehzangi A, Imoto S, Sharma A. Protein Fold Recognition Using Genetic Algorithm Optimized Voting Scheme and Profile Bigram. J Softw  2016; 11(8): 756-67.
[54] 
Paliwal KK, Sharma A, Lyons J, Dehzangi A. A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Trans Nanobioscience  2014; 13(1): 44-50.
[55] 
Wei L, Zou Q. Recent progresses in machine learning-based methods for protein fold recognition. Int J Mol Sci  2016; 17: 2118.
[56] 
Li D, Ju Y, Zou Q. Protein Folds Prediction with Hierarchical Structured SVM. Curr Proteomics  2016; 13(2): 79-85.
[57] 
Zhao X, Zou Q, Liu B, Liu X. Exploratory predicting protein folding model with random forest and hybrid features. Curr Proteomics  2014; 11(4): 289-99.
[58] 
Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One  2013; 8(2): e56499.
[59] 
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol  2015; 9(Suppl. 1): S10.
[60] 
Zhang J, Liu B. PSFM-DBT: Identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation. Int J Mol Sci  2017; 18(9): 1856.
[61] 
Liu B, Wu H, Chou KC. Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat Sci  2017; 9(4): 67-91.
[62] 
Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget  2017; 8(8): 13338-43.
[63] 
Wang J, Yang B, Revote J, et al. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics  2017; 33(17): 2756-8.
[64] 
Chen W, Feng PM, Lin H, Chou KC. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res  2013; 41(6): e68.
[65] 
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One  2014; 9(9): e106691.
[66] 
Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One  2014; 9(8): e105018.
[67] 
Liu B, Fang L, Chen J, Liu F, Wang X. miRNA-dis: microRNA precursor identification based on distance structure status pairs. Mol Biosyst  2015; 11(4): 1194-204.
[68] 
Liu B, Fang L, Liu F, Wang X, Chou KC. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn  2016; 34(1): 223-35.
[69] 
Guo Y, Li M, Lu M, Wen Z, Huang Z. Predicting G-protein coupled receptors-G-protein coupling specificity based on autocross-covariance transform. Proteins  2006; 65(1): 55-60.
[70] 
Dong Q, Zhou S, Guan J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transfor-mation. Bioinformatics  2009; 25(20): 2655-62.
[71] 
Dong Q, Wang S, Wang K, Liu X, Liu B. Identification of DNA-binding proteins by auto-cross covariance transformation.  Bioinformatics Biomed 2015; pp. 470-5.
[72] 
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transfor-mation and ensemble learning. IEEE Trans Nanobioscience  2016; 15(4): 328-34.
[73] 
Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform  201; 19(2): 231-44.
[74] 
Håndstad T, Hestnes AJ, Saetrom P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics  2007; 8(1): 23.

Rights & Permissions Print Cite

Article Metrics

139

5

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893614666181212102749	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods

Abstract

Graphical Abstract

Related Journals

Related Books