A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization

Wuritu       Yang; Xiao-Juan       Zhu; Jian       Huang; Hui       Ding; Hao       Lin

Abstract

Background: The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.

Objective: The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.

Method: The benchmark dataset, feature extraction, machine learning method and published results were summarized.

Results: We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.

Conclusion: We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.

Keywords: Golgi apparatus, machine learning method, feature vector, feature selection technique, webserver, benchmark dataset.

« Previous Next »

Graphical Abstract

[1] 
Chou KC, Shen HB. Recent progress in protein subcellular location prediction. Anal Biochem  2007; 370(1): 1-16.
[2] 
Jadot M, Boonen M, Thirion J, et al. Accounting for Protein Subcellular Localization: A Compartmental Map of the Rat Liver Proteome. Mol Cell Proteomics  2017; 16(2): 194-212.
[3] 
Wan S, Duan Y, Zou Q. HPSLPred: An ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics  2017; 17(17-18)
[http://dx.doi.org/10.1002/pmic.201700262] 
[4] 
Wang Z, Zou Q, Jiang Y, et al. Review of Protein Subcellular Localization Prediction. Curr Bioinform  2014; 9: 331-42.
[5] 
Cheng X, Xiao X, Chou KC. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics  2018; 110(1): 50-8.
[6] 
Niu B, Jin YH, Feng KY, Lu WC, Cai YD, Li GZ. Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Divers  2008; 12(1): 41-5.
[7] 
Huang WL. Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes. J Theor Biol  2012; 312: 105-13.
[8] 
Lin H, Wang H, Ding H, Chen YL, Li QZ. Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor  2009; 57(3): 321-30.
[9] 
Zhu PP, Li WC, Zhong ZJ, et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol Biosyst  2015; 11(2): 558-63.
[10] 
Du P, Li Y. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics  2006; 7: 518.
[11] 
Li L, Yu S, Xiao W, et al. Protein submitochondrial localization from integrated sequence representation and SVM-based backward feature extraction. Mol Biosyst  2015; 11(1): 170-7.
[12] 
Lin H, Chen W, Yuan LF, Li ZQ, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor  2013; 61(2): 259-68.
[13] 
Mei S. Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. J Theor Biol  2012; 293: 121-30.
[14] 
Nanni L, Lumini A. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids  2008; 34(4): 653-60.
[15] 
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition. Amino Acids  2012; 43(2): 545-55.
[16] 
Zakeri P, Moshiri B, Sadeghi M. Prediction of protein submitochondria locations based on data fusion of various features of sequences. J Theor Biol  2011; 269(1): 208-16.
[17] 
Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML. Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol  2009; 259(2): 366-72.
[18] 
Hu J, Yan XBS-KNN. An Effective Algorithm for Predicting Protein Subchloroplast Localization. Evol Bioinform Online  2012; 8: 79-87.
[19] 
Huang C, Yuan JQ. Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions. J Theor Biol  2013; 335: 205-12.
[20] 
Saravanan V, Lakshmi PT. SCLAP: an adaptive boosting method for predicting subchloroplast localization of plant proteins. OMICS  2013; 17(2): 106-15.
[21] 
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res  2016; 15(12): 4755-62.
[22] 
Wan S, Mak MW, Kung SY. Transductive Learning for Multi-Label Protein Subchloroplast Localization Prediction. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(1): 212-24.
[23] 
Wang X, Zhang W, Zhang Q, Li GZ. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier. Bioinformatics  2015; 31(16): 2639-45.
[24] 
Lin H, Ding C, Yuan LF, et al. Predicting Subchloroplast Locations Of Proteins Based on the General Form Of Chou’s Pseudo Amino Acid Composition: Approached From Optimal Tripeptide Composition. Int J Biomath  2013; 6(2): 1350003.
[25] 
Pfeffer SR. Constructing a Golgi complex. J Cell Biol  2001; 155(6): 873-5.
[26] 
Ding H, Liu L, Guo FB, Huang J, Lin H. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition. Protein Pept Lett  2011; 18(1): 58-63.
[27] 
Ding H, Guo SH, Deng EZ, et al. Prediction of Golgi-resident protein types by using feature selection technique. Chemometr Intell Lab  2013; 124: 9-13.
[28] 
Yang R, Zhang C, Gao R, Zhang L. A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data. Int J Mol Sci  2016; 17(2): 218.
[29] 
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol  2016; 391: 35-42.
[30] 
Jiao YS, Du PF. Prediction of Golgi-resident protein types using general form of Chou’s pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. J Theor Biol  2016; 402: 38-44.
[31] 
Ahmad J, Javed F, Hayat M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif Intell Med  2017; 78: 14-22.
[32] 
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics  2012; 28(23): 3150-2.
[33] 
Wang G, Dunbrack RL Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res  2005; 33: W94-8.
[34] 
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res  2015; 43(W1): W65-71.
[35] 
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med  2017; 79: 1-8.
[36] 
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods  2016; 93: 84-91.
[37] 
He B, Kang J, Ru B, Ding H, Zhou P, Huang J. SABinder: A Web Service for Predicting Streptavidin-Binding Peptides. BioMed Res Int  2016; 2016: 9175143.
[38] 
Tang Q, Nie F, Kang J, Ding H, Zhou P, Huang J. NIEluter: Predicting peptides eluted from HLA class I molecules. J Immunol Methods  2015; 422: 22-7.
[39] 
Liu B, Zhang D, Xu R, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics  2014; 30(4): 472-9.
[40] 
Chen J, Long R, Wang XL, Liu B, Chou KC. dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci Rep  2016; 6: 32333.
[41] 
Chen J, Guo M, Li S, et al. ProtDec-LTR2.0: An improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformaitcs  2016; 33(21): 3473-6.
[42] 
Chai G, Yu M, Jiang L, et al. HMMCAS: a web tool for the identification and domain annotations of Cas proteins. IEEE/ACM Trans Comput Biol Bioinform 2017.
[43] 
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics  2005; 21(1): 10-9.
[44] 
Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget  2017; 8(8): 13338-43.
[45] 
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Mol Inform  2015; 34(1): 8-17.
[46] 
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep  2015; 5: 15479.
[47] 
Zou Q, Zeng J, Cao L, et al. A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification. Neurocomputing  2016; 173: 346-54.
[48] 
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol  2016; 10(Suppl. 4): 114.
[49] 
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics  2015; 290(5): 1919-31.
[50] 
Tang H, Cao RZ, Wang W, Liu TS, Wang LM, He CM. A two-step discriminated method to identify thermophilic proteins. Int J Biomath  2017; 10: 1750050.
[51] 
Cao R, Bhattacharya D, Adhikari B, Li J, Cheng J. Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics  2015; 31(12): i116-23.
[52] 
Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou KC. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget  2016; 7(43): 69783-93.
[53] 
Yang H, Tang H, Chen XX, et al. Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. BioMed Res Int  2016; 2016: 5413903.
[54] 
Chen W, Feng P, Tang H, Ding H, Lin H. Identifying 2′-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. Genomics  2016; 107(6): 255-8.
[55] 
Chen XX, Tang H, Li WC, et al. Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition. BioMed Res Int  2016; 2016: 1654623.
[56] 
Ding H, Feng PM, Chen W, Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst  2014; 10(8): 2229-35.
[57] 
Ding H, Deng EZ, Yuan LF, et al. iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res Int  2014; 2014: 286419.
[58] 
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell  2005; 27(8): 1226-38.
[59] 
Lin H, Ding H, Chen W. Prediction of Golgi-Resident Protein Types Using Computational Method. In: Frontiers in Protein and Peptide Sciences, Ben M. Dunn, Bentham 2014; pp:174-93. [60] Liao Z, Ju Y, Zou Q. Prediction of G-protein-coupled receptors with SVM-Prot features and random forest. Scientifica  2016; 2016: 8309253.
[61] 
Chen W, Xing P, Zou Q. Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines. Sci Rep  2017; 7: 40242.
[62] 
Liu B, Yang F, Chou KC. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucleic Acids  2017; 7: 267-77.
[63] 
Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Sci Rep  2016; 6: 23934.
[64] 
Chen J, Wang X, Liu B. iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions. Sci Rep  2016; 6: 19062.
[65] 
Chen W, Lin H. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Comput Biol Med  2012; 42(4): 504-7.
[66] 
Chen W, Lin H. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information. Biochem Biophys Res Commun  2010; 401(3): 382-4.
[67] 
Chen W, Feng P, Lin H. Prediction of ketoacyl synthase family using reduced amino acid alphabets. J Ind Microbiol Biotechnol  2012; 39(4): 579-84.
[68] 
Cao R, Wang Z, Wang Y, Cheng J. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines. BMC Bioinformatics  2014; 15: 120.
[69] 
Tang H, Zhang C, Chen R, et al. Identification of Secretory Proteins of Malaria Parasite by Feature Selection Technique. Lett Org Chem  2017; 14: 621-4.
[70] 
Ye J, Chen W, Jin D. Predicting the Types of Plant Heat Shock Proteins. Lett Org Chem  2017; 14: 684-9.
[71] 
Zhao X, Zou Q, Liu B, et al. Exploratory predicting protein folding model with random forest and hybrid features. Curr Proteomics  2014; 11: 289-99.
[72] 
Liu B, Long R, Chou KC. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics  2016; 32(16): 2411-8.
[73] 
Lin H, Liang ZY, Tang H, et al. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform 2017.
[74] 
Liu B, Fang L, Liu F, Wang X, Chou KC. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn  2016; 34(1): 223-35.
[75] 
Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Mol Genet Genomics  2016; 291(1): 473-81.
[76] 
Chen W, Ding H, Feng P, Lin H, Chou KC. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget  2016; 7(13): 16895-909.
[77] 
Chen W, Feng P, Ding H, Lin H. Identifying N 6-methyladenosine sites in the Arabidopsis thaliana transcriptome. Mol Genet Genomics  2016; 291(6): 2225-9.
[78] 
Liu Y, Zeng X, He Z, et al. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(4): 905-15.
[79] 
Zeng X, Liao Y, Liu Y, et al. Prediction and Validation of Disease Genes Using HeteSim Scores. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(3):687-695.
[80] 
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun  2016; 477(1): 150-4.
[81] 
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst  2016; 12(4): 1269-75.
[82] 
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res  2014; 42(21): 12961-72.
[83] 
Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics  2014; 30(11): 1522-9.
[84] 
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol  2008; 252(2): 350-6.
[85] 
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou’s trinucleotide composition. Comput Methods Programs Biomed  2017; 146: 69-75.
[86] 
Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget  2017; 8(3): 4208-17.
[87] 
Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. Mol Ther Nucleic Acids  2016; 5: e332.
[88] 
Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Comput Biol  2017; 13(6): e1005420.
[89] 
Zhao YW, Su ZD, Yang W, Lin H, Chen W, Tang H. IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types. Int J Mol Sci  2017; 18(9): 18.
[90] 
Zhang T, Tan P, Wang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res  2017; 45(D1): D135-8.
[91] 
Liang ZY, Lai HY, Yang H, et al. Pro54DB: a database for experimentally verified sigma-54 promoters. Bioinformatics  2017; 33(3): 467-9.
[92] 
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics  2017; 33: 3518-23.
[93] 
Feng P, Ding H, Lin H, Chen W. AOD: the antioxidant protein database. Sci Rep  2017; 7(1): 7449.
[94] 
Ding H, Yang W, Tang H, et al. PHYPred: a tool for identifying bacteriophage enzymes and hydrolases. Virol Sin  2016; 31(4): 350-2.
[95] 
Li WC, Deng EZ, Ding H, et al. iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chem Intell Lab  2015; 141: 100-6.
[96] 
Lin C, Chen W, Qiu C, et al. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing  2014; 123: 424-35.
[97] 
Zou Q, Guo J, Ju Y, Wu M, Zeng X, Hong Z. Improving tRNAscan-SE annotation results via ensemble classifiers. Mol Inform  2015; 34(11-12): 761-70.
[98] 
Zou Q, Wang Z, Guan X, Liu B, Wu Y, Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res Int  2013; 2013: 686090.
[99] 
Cao R, Bhattacharya D, Hou J, Cheng J. DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinformatics  2016; 17(1): 495.
[100] 
Ju Y, Zhang S, Ding N, Zeng X, Zhang X. Complex Network Clustering by a Multi-objective Evolutionary Algorithm Based on Decomposition and Membrane Structure. Sci Rep  2016; 6: 33870.

Rights & Permissions Print Cite

Article Metrics

54

5

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893613666181113131415	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization

Abstract

Graphical Abstract

Related Journals

Related Books