Advances in the Prediction of Protein Subcellular Locations with Machine Learning

Author(s): Ting-He Zhang, Shao-Wu Zhang*

Journal Name: Current Bioinformatics

Volume 14 , Issue 5 , 2019

Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods.

Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers.

Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.

Keywords: Protein subcellular location, prediction, dataset construction, feature representation, machine learning, protein sequences.

Chou KC. Prediction of protein structural classes and subcellular locations. Curr Protein Pept Sci 2000; 1(2): 171-208.
Kaytor MD, Warren ST. Aberrant protein deposition and neurological disease. J Biol Chem 1999; 274(53): 37507-10.
Hung MC, Link W. Protein localization in disease and therapy. J Cell Sci 2011; 124(Pt 20): 3381-92.
Chen Y, Chen CF, Riley DJ, et al. Aberrant subcellular localization of BRCA1 in breast cancer. Science 1995; 270(5237): 789-91.
Zhang SW, Liu YF, Yu Y, Zhang TH, Fan XN. MSLoc-DT: a new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 2014; 449: 164-71.
Webb CD, Resnekov O. Use of green fluorescent protein for visualization for cell-specific gene expression and subcellular protein localization in Bacillus subtilis. Methods Enzymol 1999; 302: 136-53.
Jiang XS, Dai J, Sheng QH, et al. A comparative proteomic strategy for subcellular proteome research: ICAT approach coupled with bioinformatics prediction to ascertain rat liver mitochondrial proteins and indication of mitochondrial localization for catalase. Mol Cell Proteomics 2005; 4(1): 12-34.
Glory E, Murphy RF. Automated subcellular location determination and high-throughput microscopy. Dev Cell 2007; 12(1): 7-16.
Fagerberg L, Stadler C, Skogs M, et al. Mapping the subcellular protein distribution in three human cell lines. J Proteome Res 2011; 10(8): 3766-77.
Breckels LM, Gatto L, Christoforou A, Groen AJ, Lilley KS, Trotter MW. The effect of organelle discovery upon sub-cellular protein localisation. J Proteomics 2013; 88: 129-40.
Imai K, Nakai K. Prediction of subcellular locations of proteins: where to proceed? Proteomics 2010; 10(22): 3970-83.
Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003; 31(1): 365-70.
Zhou H, Yang Y, Shen HB. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics 2017; 33(6): 843-53.
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001; 43(3): 246-55.
Chou KC, Shen HB. Recent progress in protein subcellular location prediction. Anal Biochem 2007; 370(1): 1-16.
Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2008; 34(4): 565-72.
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2011; 273(1): 236-47.
Chou KC. Some remarks on predicting multi-label attributes in molecular biosystems. Mol Biosyst 2013; 9(6): 1092-100.
Chou KC. Impacts of bioinformatics to medicinal chemistry. Med Chem 2015; 11(3): 218-34.
Chou KC. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Curr Top Med Chem 2017; 17(21): 2337-58.
Du P, Xu C. Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 2013; 10(3): 227-37.
Nakai K, Kanehisa M. Expert system for predicting protein localization sites in gram-negative bacteria. Proteins 1991; 11(2): 95-110.
Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 2000; 300(4): 1005-16.
Horton P, Park KJ, Obayashi T, et al. WoLF PSORT: protein localization predictor. Nucleic Acids Res 2007; 35(Web Server issue) W585-7
Nair R, Rost B. Sequence conserved for subcellular localization. Protein Sci 2002; 11(12): 2836-47.
Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004; 14(10A): 1957-66.
Wan S, Mak MW, Kung SY. GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J Theor Biol 2013; 323: 40-8.
Shi JY, Zhang SW, Pan Q, Zhou GP. Using pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution. Amino Acids 2008; 35(2): 321-7.
Bhasin M, Raghava GPS. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res 2004; 32(Web Server issue): W414-9.
Shi JY, Zhang SW, Liang Y, Pan Q. Prediction of protein subcellular localizations using moment descriptors and support vector machine. In: Rajapakse JC, Wong L, Acharya R, Eds. Proceedings of International Workshop, PRIB 2006. Lecture Notes in Computer Science, Hong Kong, China 2006; Vol. 4146: pp:105-14.
Shi JY, Zhang SW, Pan Q, Cheng YM, Xie J. Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition. Amino Acids 2007; 33(1): 69-74.
Chou KC, Shen HB. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS One 2010; 5(6): e11335.
Chou KC, Wu ZC, Xiao X. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 2011; 6(3): e18258.
Chou KC, Shen HB. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS One 2010; 5(4): e9931.
Shen HB, Chou KC. Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 2010; 264(2): 326-33.
Shen HB, Chou KC. Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J Biomol Struct Dyn 2010; 28(2): 175-86.
Wan S, Mak MW, Kung SY. mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinformatics 2012; 13: 290.
Li L, Zhang Y, Zou L, et al. An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One 2012; 7(1): e31057.
Wan S, Mak MW, Kung SY. HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 2014; 9(3): e89545.
Zhang SB, Tang QR. Predicting protein subcellular localization based on information content of gene ontology terms. Comput Biol Chem 2016; 65: 1-7.
Chou KC, Shen HB. Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun 2006; 347(1): 150-7.
Chou KC, Shen HB. Large-scale predictions of gram-negative bacterial protein subcellular locations. J Proteome Res 2006; 5(12): 3420-8.
Chou KC, Shen HB. Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J Proteome Res 2007; 6(5): 1728-34.
Huang Y, Li Y. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics 2004; 20(1): 21-8.
Nasibov E, Kandemir-Cavas C. Protein subcellular location prediction using optimally weighted fuzzy k-NN algorithm. Comput Biol Chem 2008; 32(6): 448-51.
Xiao X, Wu ZC, Chou KC. iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J Theor Biol 2011; 284(1): 42-51.
Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci 2005; 14(11): 2804-13.
Dehzangi A, Sohrabi S, Heffernan R, et al. Gram-positive and Gram-negative subcellular localization using rotation forest and physicochemical-based features. BMC Bioinformatics 2015; 16(Suppl. 4): S1.
Pan XY, Zhang YN, Shen HB. Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. J Proteome Res 2010; 9(10): 4992-5001.
Salvatore M, Warholm P, Shu N, Basile W, Elofsson A. SubCons: a new ensemble method for improved human subcellular localization predictions. Bioinformatics 2017; 33(16): 2464-70.
Chou KC, Wu ZC, Xiao X. iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol Biosyst 2012; 8(2): 629-41.
Wan S, Mak MW, Kung SY. R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 2014; 360: 34-45.
Wan S, Mak MW, Kung SY. mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal Biochem 2015; 473: 14-27.
Wan S, Mak MW, Kung SY. Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins. BMC Bioinformatics 2016; 17: 97.
Wan S, Mak MW, Kung SY. Transductive Learning for Multi- Label protein subchloroplast localization prediction. IEEE/ACM Trans IEEE/ACM Trans Comput Biol Bioinformatics 2017; 14(1): 212-24.
Lu Z, Szafron D, Greiner R, et al. Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 2004; 20(4): 547-56.
Chou KC, Shen HB. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. J Proteome Res 2006; 5(8): 1888-97.
Blum T, Briesemeister S, Kohlbacher O. MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics 2009; 10: 274.
Shen HB, Chou KC. A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Anal Biochem 2009; 394(2): 269-74.
Shen HB, Chou KC. Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. Protein Pept Lett 2009; 16(12): 1478-84.
Briesemeister S, Rahnenführer J, Kohlbacher O. Going from where to why--interpretable prediction of protein subcellular localization. Bioinformatics 2010; 26(9): 1232-8.
Yu NY, Wagner JR, Laird MR, et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 2010; 26(13): 1608-15.
Mooney C, Wang YH, Pollastri G. SCLpred: protein subcellular localization prediction by N-to-1 neural networks. Bioinformatics 2011; 27(20): 2812-9.
Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol Biosyst 2011; 7(12): 3287-97.
Xiao X, Wu ZC, Chou KC. A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS One 2011; 6(6): e20592.
Lin JR, Mondal AM, Liu R, Hu J. Minimalist ensemble algorithms for genome-wide protein localization prediction. BMC Bioinformatics 2012; 13: 157.
Magnus M, Pawlowski M, Bujnicki JM. MetaLocGramN: A meta-predictor of protein subcellular localization for Gram-negative bacteria. Biochim Biophys Acta 2012; 1824(12): 1425-33.
Wu ZC, Xiao X, Chou KC. iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins. Protein Pept Lett 2012; 19(1): 4-14.
Yoon Y, Lee GG. Subcellular localization prediction through boosting association rules. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(2): 609-18.
Chi SM, Nam D. WegoLoc: accurate prediction of protein subcellular localization using weighted Gene Ontology terms. Bioinformatics 2012; 28(7): 1028-30.
Lin WZ, Fang JA, Xiao X, Chou KC. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Mol Biosyst 2013; 9(4): 634-44.
Liu L, Zhang Z, Mei Q, Chen M. PSI: a comprehensive and integrative approach for accurate plant subcellular localization prediction. PLoS One 2013; 8(10): e75826.
Wan S, Mak MW, Kung SY. FUEL-mLoc: feature-unified prediction and explanation of multi-localization of cellular proteins in multiple organisms. Bioinformatics 2017; 33(5): 749-50.
Briesemeister S, Rahnenführer J, Kohlbacher O. YLoc--an interpretable web server for predicting subcellular localization. Nucleic Acids Res 2010; 38(Web Server issue): W497-502.
Salvatore M, Shu N, Elofsson A. The SubCons webserver: A user friendly web interface for state-of-the-art subcellular localization prediction. Protein Sci 2018; 27(1): 195-201.
Wang X, Zhang W, Zhang Q, Li GZ. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier. Bioinformatics 2015; 31(16): 2639-45.
King BR, Vural S, Pandey S, Barteau A, Guda C. ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes. BMC Res Notes 2012; 5: 351.
Millar AH, Carrie C, Pogson B, Whelan J. Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell 2009; 21(6): 1625-31.
Cheng X, Xiao X, Chou KC. pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information. Bioinformatics 2017; 34(9): 1448-56.
Cheng X, Zhao SG, Lin WZ, Xiao X, Chou KC. pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 2017; 33(22): 3524-31.
Cheng X, Xiao X, Chou KC. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018; 110(1): 50-8.
Cheng X, Xiao X, Chou KC. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 2017; pii: S0888- 7543(17)30102-7.
Almagro Armenteros JJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 2017; 33(21): 3387-95.
Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics 2003; 19(12): 1589-91.
Nakashima H, Nishikawa K, Ooi T. The folding type of a protein is relevant to the amino acid composition. J Biochem 1986; 99(1): 153-62.
Zhang SW, Hao LY, Zhang TH. Prediction of protein-protein interaction with pairwise kernel support vector machine. Int J Mol Sci 2014; 15(2): 3220-33.
Chen W, Zhang SW, Cheng YM, Pan Q. Prediction of protein-protein interaction types using the decision templates based on multiple classier fusion. Math Comput Model 2010; 52: 2075-84.
Chen W, Zhang SW, Cheng YM, Pan Q. Identification of protein-RNA interaction sites using the information of spatial adjacent residues. Proteome Sci 2011; 9(Suppl. 1): S16.
Zhang SW, Chen W, Yang F, Pan Q. Using Chou’s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35(3): 591-8.
Chou KC. The convergence-divergence duality in lectin domains of selectin family and its implications. FEBS Lett 1995; 363(1-2): 123-6.
Schäffer AA, Aravind L, Madden TL, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29(14): 2994-3005.
Zhang SW, Wei ZG. Some remarks on prediction of protein-protein interaction with machine learning. Med Chem 2015; 11(3): 254-64.
Gene Ontology Consortium. going forward. Nucleic Acids Res 2015; 43(Database issue): D1049-56.
Yang H, Nepusz T, Paccanaro A. Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics 2012; 28(10): 1383-9.
Denoeux T. A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans Syst Man Cybern 1995; 25: 804-13.
Shen H, Chou KC. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem Biophys Res Commun 2005; 334(1): 288-92.
Shafer G. A Mathematical Theory of Evidence. Princeton, NJ: Princeton University Press 1976.
Zouhal LM, Denoeux T. An evidence-theoretic K-NN rule with parameter optimization. IEEE Trans Syst Man Cybern 1998; 28: 263-71.
Shen HB, Yang J, Chou KC. Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J Theor Biol 2006; 240(1): 9-13.
Keller JM, Gray MR, Givens JA. A fuzzy k-nearest neighbours algorithm. IEEE Trans Syst Man Cybern 1985; 15: 580-5.
Zhang ML, Zhou ZH. ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 2007; 40: 2038-48.
Vapnik V. Statistical learning theory. New York: Wiley 1998.
Joachims T. Making large-scale SVM learning practical. Cambridge: MIT Press 1999.
Zhang SW, Fan XN. Computational methods for predicting ncRNA-protein interactions. Med Chem 2017; 13(6): 515-25.
Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G. Support vector machines and kernels for computational biology. PLOS Comput Biol 2008; 4(10): e1000173.
Murphy KP. Naive bayes classifiers. University of British Columbia 2006.
Wang Y, Chen X, Liu ZP, et al. De novo prediction of RNA-protein interactions from sequence information. Mol Biosyst 2013; 9(1): 133-42.
Rodríguez JJ, Kuncheva LI, Alonso CJ. Rotation forest: A new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 2006; 28(10): 1619-30.
Kuncheva LI, Rodriguez JJ. An experimental study on rotation forest ensembles. Haindl, M; Kittler, J; Roli, F. Lect Notes in Comput Sci 2007; 4472: 459-68.
Stiglic G, Rodriguez JJ, Kokol P. Rotation of random forests for genomic and proteomic classification problems. Adv Exp Med Biol 2011; 696: 211-21.
Xia JF, Han K, Huang DS. Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept Lett 2010; 17(1): 137-45.
Kuncheva LI. Using measures of similarity and inclusion for multiple classifier fusion by decision templates. Fuzzy Sets Syst 2001; 122: 401-7.
Kuncheva LI, Bezdek JC, Duin RPW. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognit 2001; 34: 299-314.
Yan XY, Zhang SW. Identifying drug-target interactions with decision template. Curr Protein Pept Sci 2018; 19(5): 498-506.
Chou KC, Zhang CT. Prediction of protein structural classes. Crit Rev Biochem Mol Biol 1995; 30(4): 275-349.
Zhang SW, Pan Q, Zhang HC, Zhang YL, Wang HY. Classification of protein quaternary structure with support vector machine. Bioinformatics 2003; 19(18): 2390-6.
Zhang SW, Pan Q, Zhang HC. Zhang, Shao, Z.C.; Shi, J.Y. Prediction protein homo-oligomer types by pesudo amino acid composition: approached with an improved feature extraction and naive bayes feature fusion. Amino Acids 2006; 30: 461-8.
Zhang SW, Shao DD, Zhang SY, Wang YB. Prioritization of candidate disease genes by enlarging the seed set and fusing information of the network topology and gene expression. Mol Biosyst 2014; 10(6): 1400-8.
Zhang SW, Zhang TH, Zhang JN, Huang Y. Prediction of signal peptide cleavage sites with subsite-coupled and template matching fusion algorithm. Mol Inform 2014; 33(3): 230-9.
Zhang SW, Yan XY. Some Remarks on Prediction of Drug-Target Interaction with Network Models. Curr Top Med Chem 2017; 17(21): 2456-68.
Yan XY, Zhang SW, Zhang SY. Prediction of drug-target interaction by label propagation with mutual interaction information derived from heterogeneous network. Mol Biosyst 2016; 12(2): 520-31.
Fan XN, Zhang SW. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. Mol Biosyst 2015; 11(3): 892-7.
Luo Y, Zhao X, Zhou J, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun 2017; 8(1): 573.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Published on: 27 June, 2019
Page: [406 - 421]
Pages: 16
DOI: 10.2174/1574893614666181217145156
Price: $65

Article Metrics

PDF: 48