A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization

Author(s): Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding, Hao Lin*.

Journal Name: Current Bioinformatics

Volume 14 , Issue 3 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.

Objective: The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.

Method: The benchmark dataset, feature extraction, machine learning method and published results were summarized.

Results: We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.

Conclusion: We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.

Keywords: Golgi apparatus, machine learning method, feature vector, feature selection technique, webserver, benchmark dataset.

Chou KC, Shen HB. Recent progress in protein subcellular location prediction. Anal Biochem 2007; 370(1): 1-16.
Jadot M, Boonen M, Thirion J, et al. Accounting for Protein Subcellular Localization: A Compartmental Map of the Rat Liver Proteome. Mol Cell Proteomics 2017; 16(2): 194-212.
Wan S, Duan Y, Zou Q. HPSLPred: An ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 2017; 17(17-18)
Wang Z, Zou Q, Jiang Y, et al. Review of Protein Subcellular Localization Prediction. Curr Bioinform 2014; 9: 331-42.
Cheng X, Xiao X, Chou KC. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018; 110(1): 50-8.
Niu B, Jin YH, Feng KY, Lu WC, Cai YD, Li GZ. Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Divers 2008; 12(1): 41-5.
Huang WL. Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes. J Theor Biol 2012; 312: 105-13.
Lin H, Wang H, Ding H, Chen YL, Li QZ. Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor 2009; 57(3): 321-30.
Zhu PP, Li WC, Zhong ZJ, et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol Biosyst 2015; 11(2): 558-63.
Du P, Li Y. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics 2006; 7: 518.
Li L, Yu S, Xiao W, et al. Protein submitochondrial localization from integrated sequence representation and SVM-based backward feature extraction. Mol Biosyst 2015; 11(1): 170-7.
Lin H, Chen W, Yuan LF, Li ZQ, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61(2): 259-68.
Mei S. Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. J Theor Biol 2012; 293: 121-30.
Nanni L, Lumini A. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 2008; 34(4): 653-60.
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition. Amino Acids 2012; 43(2): 545-55.
Zakeri P, Moshiri B, Sadeghi M. Prediction of protein submitochondria locations based on data fusion of various features of sequences. J Theor Biol 2011; 269(1): 208-16.
Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML. Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 2009; 259(2): 366-72.
Hu J, Yan XBS-KNN. An Effective Algorithm for Predicting Protein Subchloroplast Localization. Evol Bioinform Online 2012; 8: 79-87.
Huang C, Yuan JQ. Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions. J Theor Biol 2013; 335: 205-12.
Saravanan V, Lakshmi PT. SCLAP: an adaptive boosting method for predicting subchloroplast localization of plant proteins. OMICS 2013; 17(2): 106-15.
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res 2016; 15(12): 4755-62.
Wan S, Mak MW, Kung SY. Transductive Learning for Multi-Label Protein Subchloroplast Localization Prediction. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(1): 212-24.
Wang X, Zhang W, Zhang Q, Li GZ. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier. Bioinformatics 2015; 31(16): 2639-45.
Lin H, Ding C, Yuan LF, et al. Predicting Subchloroplast Locations Of Proteins Based on the General Form Of Chou’s Pseudo Amino Acid Composition: Approached From Optimal Tripeptide Composition. Int J Biomath 2013; 6(2): 1350003.
Pfeffer SR. Constructing a Golgi complex. J Cell Biol 2001; 155(6): 873-5.
Ding H, Liu L, Guo FB, Huang J, Lin H. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition. Protein Pept Lett 2011; 18(1): 58-63.
Ding H, Guo SH, Deng EZ, et al. Prediction of Golgi-resident protein types by using feature selection technique. Chemometr Intell Lab 2013; 124: 9-13.
Yang R, Zhang C, Gao R, Zhang L. A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data. Int J Mol Sci 2016; 17(2): 218.
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol 2016; 391: 35-42.
Jiao YS, Du PF. Prediction of Golgi-resident protein types using general form of Chou’s pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. J Theor Biol 2016; 402: 38-44.
Ahmad J, Javed F, Hayat M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif Intell Med 2017; 78: 14-22.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28(23): 3150-2.
Wang G, Dunbrack RL Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res 2005; 33: W94-8.
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 2015; 43(W1): W65-71.
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med 2017; 79: 1-8.
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016; 93: 84-91.
He B, Kang J, Ru B, Ding H, Zhou P, Huang J. SABinder: A Web Service for Predicting Streptavidin-Binding Peptides. BioMed Res Int 2016; 2016: 9175143.
Tang Q, Nie F, Kang J, Ding H, Zhou P, Huang J. NIEluter: Predicting peptides eluted from HLA class I molecules. J Immunol Methods 2015; 422: 22-7.
Liu B, Zhang D, Xu R, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 2014; 30(4): 472-9.
Chen J, Long R, Wang XL, Liu B, Chou KC. dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci Rep 2016; 6: 32333.
Chen J, Guo M, Li S, et al. ProtDec-LTR2.0: An improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformaitcs 2016; 33(21): 3473-6.
Chai G, Yu M, Jiang L, et al. HMMCAS: a web tool for the identification and domain annotations of Cas proteins. IEEE/ACM Trans Comput Biol Bioinform 2017.
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005; 21(1): 10-9.
Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget 2017; 8(8): 13338-43.
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Mol Inform 2015; 34(1): 8-17.
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 2015; 5: 15479.
Zou Q, Zeng J, Cao L, et al. A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification. Neurocomputing 2016; 173: 346-54.
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol 2016; 10(Suppl. 4): 114.
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290(5): 1919-31.
Tang H, Cao RZ, Wang W, Liu TS, Wang LM, He CM. A two-step discriminated method to identify thermophilic proteins. Int J Biomath 2017; 10: 1750050.
Cao R, Bhattacharya D, Adhikari B, Li J, Cheng J. Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics 2015; 31(12): i116-23.
Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou KC. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget 2016; 7(43): 69783-93.
Yang H, Tang H, Chen XX, et al. Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. BioMed Res Int 2016; 2016: 5413903.
Chen W, Feng P, Tang H, Ding H, Lin H. Identifying 2′-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. Genomics 2016; 107(6): 255-8.
Chen XX, Tang H, Li WC, et al. Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition. BioMed Res Int 2016; 2016: 1654623.
Ding H, Feng PM, Chen W, Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst 2014; 10(8): 2229-35.
Ding H, Deng EZ, Yuan LF, et al. iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res Int 2014; 2014: 286419.
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27(8): 1226-38.
Lin H, Ding H, Chen W. Prediction of Golgi-Resident Protein Types Using Computational Method. In: Frontiers in Protein and Peptide Sciences, Ben M. Dunn, Bentham 2014; pp:174-93. [60] Liao Z, Ju Y, Zou Q. Prediction of G-protein-coupled receptors with SVM-Prot features and random forest. Scientifica 2016; 2016: 8309253.
Chen W, Xing P, Zou Q. Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines. Sci Rep 2017; 7: 40242.
Liu B, Yang F, Chou KC. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucleic Acids 2017; 7: 267-77.
Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Sci Rep 2016; 6: 23934.
Chen J, Wang X, Liu B. iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions. Sci Rep 2016; 6: 19062.
Chen W, Lin H. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Comput Biol Med 2012; 42(4): 504-7.
Chen W, Lin H. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information. Biochem Biophys Res Commun 2010; 401(3): 382-4.
Chen W, Feng P, Lin H. Prediction of ketoacyl synthase family using reduced amino acid alphabets. J Ind Microbiol Biotechnol 2012; 39(4): 579-84.
Cao R, Wang Z, Wang Y, Cheng J. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines. BMC Bioinformatics 2014; 15: 120.
Tang H, Zhang C, Chen R, et al. Identification of Secretory Proteins of Malaria Parasite by Feature Selection Technique. Lett Org Chem 2017; 14: 621-4.
Ye J, Chen W, Jin D. Predicting the Types of Plant Heat Shock Proteins. Lett Org Chem 2017; 14: 684-9.
Zhao X, Zou Q, Liu B, et al. Exploratory predicting protein folding model with random forest and hybrid features. Curr Proteomics 2014; 11: 289-99.
Liu B, Long R, Chou KC. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 2016; 32(16): 2411-8.
Lin H, Liang ZY, Tang H, et al. Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform 2017.
Liu B, Fang L, Liu F, Wang X, Chou KC. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn 2016; 34(1): 223-35.
Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Mol Genet Genomics 2016; 291(1): 473-81.
Chen W, Ding H, Feng P, Lin H, Chou KC. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2016; 7(13): 16895-909.
Chen W, Feng P, Ding H, Lin H. Identifying N 6-methyladenosine sites in the Arabidopsis thaliana transcriptome. Mol Genet Genomics 2016; 291(6): 2225-9.
Liu Y, Zeng X, He Z, et al. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(4): 905-15.
Zeng X, Liao Y, Liu Y, et al. Prediction and Validation of Disease Genes Using HeteSim Scores. IEEE/ACM Trans Comput Biol Bioinform 2017. 14(3):687-695.
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016; 477(1): 150-4.
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst 2016; 12(4): 1269-75.
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014; 42(21): 12961-72.
Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 2014; 30(11): 1522-9.
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol 2008; 252(2): 350-6.
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou’s trinucleotide composition. Comput Methods Programs Biomed 2017; 146: 69-75.
Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget 2017; 8(3): 4208-17.
Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. Mol Ther Nucleic Acids 2016; 5: e332.
Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Comput Biol 2017; 13(6): e1005420.
Zhao YW, Su ZD, Yang W, Lin H, Chen W, Tang H. IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types. Int J Mol Sci 2017; 18(9): 18.
Zhang T, Tan P, Wang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res 2017; 45(D1): D135-8.
Liang ZY, Lai HY, Yang H, et al. Pro54DB: a database for experimentally verified sigma-54 promoters. Bioinformatics 2017; 33(3): 467-9.
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017; 33: 3518-23.
Feng P, Ding H, Lin H, Chen W. AOD: the antioxidant protein database. Sci Rep 2017; 7(1): 7449.
Ding H, Yang W, Tang H, et al. PHYPred: a tool for identifying bacteriophage enzymes and hydrolases. Virol Sin 2016; 31(4): 350-2.
Li WC, Deng EZ, Ding H, et al. iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chem Intell Lab 2015; 141: 100-6.
Lin C, Chen W, Qiu C, et al. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing 2014; 123: 424-35.
Zou Q, Guo J, Ju Y, Wu M, Zeng X, Hong Z. Improving tRNAscan-SE annotation results via ensemble classifiers. Mol Inform 2015; 34(11-12): 761-70.
Zou Q, Wang Z, Guan X, Liu B, Wu Y, Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res Int 2013; 2013: 686090.
Cao R, Bhattacharya D, Hou J, Cheng J. DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinformatics 2016; 17(1): 495.
Ju Y, Zhang S, Ding N, Zeng X, Zhang X. Complex Network Clustering by a Multi-objective Evolutionary Algorithm Based on Decomposition and Membrane Structure. Sci Rep 2016; 6: 33870.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [234 - 240]
Pages: 7
DOI: 10.2174/1574893613666181113131415
Price: $58

Article Metrics

PDF: 35
PRC: 1