HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network

Author(s): Rahul Semwal, Pritish Kumar Varadwaj*

Journal Name: Current Genomics

Volume 21 , Issue 7 , 2020


Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Abstract:

Aims: To develop a tool that can annotate subcellular localization of human proteins.

Background: With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/ compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research.

Objective: To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision.

Methods: In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset.

Results: The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit).

Conclusion: In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.

Keywords: Bioinformatics, subcellular localization, machine learning, human protein, deep learning, deep neural network.

[1]
Popgeorgiev, N.; Jabbour, L.; Gillet, G. Subcellular localization and dynamics of the Bcl-2 family of proteins. Front. Cell Dev. Biol., 2018, 6, 13.
[http://dx.doi.org/10.3389/fcell.2018.00013] [PMID: 29497611]
[2]
Scott, M.S.; Calafell, S.J.; Thomas, D.Y.; Hallett, M.T. Refining protein subcellular localization. PLOS Comput. Biol., 2005, 1(6), e66
[http://dx.doi.org/10.1371/journal.pcbi.0010066] [PMID: 16322766]
[3]
D??nnes, P.; H??glund, A. Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics, 2004, 2(4), 209-215.
[http://dx.doi.org/10.1016/S1672-0229(04)02027-3] [PMID: 15901249]
[4]
LaQuaglia, M.J.; Grijalva, J.L.; Mueller, K.A.; Perez-Atayde, A.R.; Kim, H.B.; Sadri-Vakili, G.; Vakili, K. YAP subcellular localization and hippo pathway transcriptome analysis in pediatric hepatocellular carcinoma. Sci. Rep., 2016, 6, 30238.
[http://dx.doi.org/10.1038/srep30238] [PMID: 27605415]
[5]
Shurety, W.; Merino-Trigo, A.; Brown, D.; Hume, D.A.; Stow, J.L. Localization and post-Golgi trafficking of tumor necrosis factor alpha in macrophages. J. Interferon Cytokine Res., 2000, 20(4), 427-438.
[http://dx.doi.org/10.1089/107999000312379]
[6]
Bryant, D.M.; Stow, J.L. The ins and outs of E-cadherin trafficking. Trends in Cell Biol., 2004, 14(8), 427-434.
[7]
Cheng, X.; Xiao, X.; Chou, K-C. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics, 2017, 110(4), 231-239.
[http://dx.doi.org/10.1016/j.ygeno.2017.10.002] [PMID: 28989035]
[8]
Hartmann, T.; Bergsdorf, C.; Sandbrink, R.; Tienari, P.J.; Multhaup, G.; Ida, N.; Bieger, S.; Dyrks, T.; Weidemann, A.; Masters, C.L. Alzheimer’s disease βA4 protein release and amyloid precursor protein sorting are regulated by alternative splicing. J. Biol. Chem., 1996, 271(22), 13208-13214.
[http://dx.doi.org/10.1074/jbc.271.22.13208]
[9]
Hadizadeh, M.; Tabatabaiepour, S.N.; Tabatabaiepour, S.Z.; Hosseini, N.H.; Mohammadi, M.; Sohrabi, S.M. Genome-wide identification of potential drug target in enterobacteriaceae family: a homology-based method. Microb. Drug Resist., 2018, 24(1), 8-17.
[http://dx.doi.org/10.1089/mdr.2016.0259] [PMID: 28520499]
[10]
Camp, R.L.; Chung, G.G.; Rimm, D.L. Automated subcellular localization and quantification of protein expression in tissue microarrays. Nat. Med., 2002, 8(11), 1323-1327.
[http://dx.doi.org/10.1038/nm791] [PMID: 12389040]
[11]
Kuo-Chen, C. Artificial intelligence (AI) tools constructed via the 5-steps rule for predicting post-translational modifications. Trends Artifi. Intell., 2019, 3(1), 60-74.
[12]
Emanuelsson, O.; Nielsen, H.; Brunak, S.; von Heijne, G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 2000, 300(4), 1005-1016.
[http://dx.doi.org/10.1006/jmbi.2000.3903] [PMID: 10891285]
[13]
Lin, C.; Zou, Y.; Qin, J.; Liu, X.; Jiang, Y.; Ke, C.; Zou, Q. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One, 2013, 8(2), e56499
[http://dx.doi.org/10.1371/journal.pone.0056499] [PMID: 23437146]
[14]
Cao, Z.; Pan, X.; Yang, Y.; Huang, Y.; Shen, H-B. The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics, 2018, 34(13), 2185-2194.
[http://dx.doi.org/10.1093/bioinformatics/bty085] [PMID: 29462250]
[15]
Hua, S.; Sun, Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 2001, 17(8), 721-728.
[http://dx.doi.org/10.1093/bioinformatics/17.8.721] [PMID: 11524373]
[16]
Park, K.J.; Kanehisa, M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 2003, 19(13), 1656-1663.
[http://dx.doi.org/10.1093/bioinformatics/btg222] [PMID: 12967962]
[17]
Pierleoni, A.; Martelli, P.L.; Fariselli, P.; Casadio, R. BaCelLo: a balanced subcellular localization predictor. Bioinformatics, 2006, 22(14), e408-e416.
[http://dx.doi.org/10.1093/bioinformatics/btl222] [PMID: 16873501]
[18]
Hoglund, A.; Donnes, P.; Blum, T.; Adolph, H.W.; Kohlbacher, O. MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics, 2006, 22(10), 1158-1165.
[http://dx.doi.org/10.1093/bioinformatics/btl002] [PMID: 16428265]
[19]
Yu, C.S.; Chen, Y.C.; Lu, C.H.; Hwang, J.K. Prediction of protein subcellular localization. Proteins, 2006, 64(3), 643-651.
[http://dx.doi.org/10.1002/prot.21018] [PMID: 16752418]
[20]
Yu, C.S.; Lin, C.J.; Hwang, J.K. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci., 2004, 13(5), 1402-1406.
[http://dx.doi.org/10.1110/ps.03479604] [PMID: 15096640]
[21]
Wang, J.; Sung, W.K.; Krishnan, A.; Li, K.B. Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC Bioinformatics, 2005, 6, 174.
[http://dx.doi.org/10.1186/1471-2105-6-174] [PMID: 16011808]
[22]
Bhasin, M.; Garg, A.; Raghava, G.P. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics, 2005, 21(10), 2522-2524.
[http://dx.doi.org/10.1093/bioinformatics/bti309] [PMID: 15699023]
[23]
Gardy, J.L.; Laird, M.R.; Chen, F.; Rey, S.; Walsh, C.J.; Ester, M.; Brinkman, F.S. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 2005, 21(5), 617-623.
[http://dx.doi.org/10.1093/bioinformatics/bti057] [PMID: 15501914]
[24]
Gardy, J.L.; Spencer, C.; Wang, K.; Ester, M.; Tusnady, G.E.; Simon, I.; Hua, S.; deFays, K.; Lambert, C.; Nakai, K.; Brinkman, F.S. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res., 2003, 31(13), 3613-3617.
[http://dx.doi.org/10.1093/nar/gkg602] [PMID: 12824378]
[25]
Uddin, M.R.; Sharma, A.; Farid, D.M.; Rahman, M.M.; Dehzangi, A.; Shatabda, S. EvoStruct-Sub: an accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. J. Theor. Biol., 2018, 443, 138-146.
[http://dx.doi.org/10.1016/j.jtbi.2018.02.002] [PMID: 29421211]
[26]
Wan, S.; Mak, M-W.; Kung, S-Y. mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal. Biochem., 2015, 473, 14-27.
[http://dx.doi.org/10.1016/j.ab.2014.10.014] [PMID: 25449328]
[27]
Mott, R.; Schultz, J.; Bork, P.; Ponting, C.P. Predicting protein cellular localization using a domain projection method. Genome Res., 2002, 12(8), 1168-1174.
[http://dx.doi.org/10.1101/gr.96802] [PMID: 12176924]
[28]
Zhou, H.; Yang, Y.; Shen, H-B. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics, 2017, 33(6), 843-853.
[PMID: 27993784]
[29]
Cozzetto, D.; Minneci, F.; Currant, H.; Jones, D.T. FFPred 3: feature-based function prediction for all Gene Ontology domains. Sci. Rep., 2016, 6, 31865.
[http://dx.doi.org/10.1038/srep31865] [PMID: 27561554]
[30]
Marcotte, E.M.; Xenarios, I.; van Der Bliek, A.M.; Eisenberg, D. Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 2000, 97(22), 12115-12120.
[http://dx.doi.org/10.1073/pnas.220399497] [PMID: 11035803]
[31]
Cheng, Y.; Perocchi, F. ProtPhylo: identification of protein-phenotype and protein-protein functional associations via phylogenetic profiling. Nucleic Acids Res., 2015, 43(W1), W160-8
[http://dx.doi.org/10.1093/nar/gkv455] [PMID: 25956654]
[32]
Goceri, E. Formulas Behind Deep Learning Success. , In: International Conference on Applied Analysis and Mathematical Modeling (ICAAMM2018), Istanbul, Turkey2018.
[33]
Goceri, E.; Gooya, A. On The Importance of Batch Size for Deep Learning, 2018.
[34]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag., 2012, 29, 1-27.
[35]
Hussain, W.; Khan, Y.D.; Rasool, N.; Khan, S.A.; Chou, K-C. SPrenylC-PseAAC: a sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. J. Theor. Biol., 2019, 468, 1-11.
[http://dx.doi.org/10.1016/j.jtbi.2019.02.007] [PMID: 30768975]
[36]
Apweiler, R.; Bairoch, A.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M. UniProt: the universal protein knowledgebase. Nucleic Acids Res., 2004, 32(suppl_1), , D115-D119.
[http://dx.doi.org/10.1093/nar/gkh131]
[37]
Li, W. Fast program for clustering and comparing large sets of protein or nucleotide sequences. Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes: Basics; Methods, Databases and Tools, 2015, pp. 173-177.
[38]
Xiao, N.; Cao, D-S.; Zhu, M-F.; Xu, Q-S. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics, 2015, 31(11), 1857-1859.
[http://dx.doi.org/10.1093/bioinformatics/btv042] [PMID: 25619996]
[39]
Team, R.C. R: A language and environment for statistical computing., R Foundation for Statistical Computing. Vienna, Austria, 2013.https://www.R-project.org/
[40]
Bengio, Y. Learning deep architectures for AI. Foundations and Trends® in Machine Learning , 2009, 2(1), 1-127.
[http://dx.doi.org/10.1561/9781601982957]
[41]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[42]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 2014, 15(1), 1929-1958.
[43]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[44]
Goceri, E. A Method for Leukocyte Segmentation Using Modified Gram-Schmidt Orthogonalization and Expectation-Maximization. International Conference on Applied Analysis and Mathematical Modeling ICAAMM18, Istanbul, Turkey2018, p. 18.
[45]
Mondal, M.; Semwal, R.; Raj, U.; Aier, I.; Varadwaj, P.K. An entropy-based classification of breast cancerous genes using microarray data. Neural Comput. Appl., 2018, 1-8, 1433-3058.
[46]
Goceri, E.; Martinez, E.D. A level set method with sobolev gradient and haralick edge detection. Int. J. Technol., 2014, 5, 2147-5369.
[47]
Goceri, E. In Effects of chosen scalar products on gradient descent algorithms, 2015, 115
[48]
Goceri, E. CapsNet topology to classify tumours from brain images and comparative evaluation. IET Image Process., 2020, 14, 882-889.
[49]
Goceri, E. Diagnosis of Alzheimer’s disease with Sobolev gradient-based optimization and 3D convolutional neural network. Int. J. Numer. Methods Biomed. Eng., 2019, 35(7), e3225
[http://dx.doi.org/10.1002/cnm.3225] [PMID: 31166647]
[50]
Zhang, S.; Yang, K.; Lei, Y.; Song, K. iRSpot-DTS: Predict recombination spots by incorporating the dinucleotide-based spare cross covariance information into Chou’s pseudo components. Genomics, 2019, 111(6), 1760-1770.
[http://dx.doi.org/10.1016/j.ygeno.2018.11.031] [PMID: 30529702]
[51]
Le, N.Q.; Ou, Y.Y. Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs. BMC Bioinformatics, 2016, 17(1), 298.
[http://dx.doi.org/10.1186/s12859-016-1163-x] [PMID: 27475771]
[52]
Mohabatkar, H.; Beigi, M.M.; Abdolahi, K.; Mohsenzadeh, S. Prediction of allergenic proteins by means of the concept of Chou’s pseudo amino acid composition and a machine learning approach. Med. Chem., 2013, 9(1), 133-137.
[http://dx.doi.org/10.2174/157340613804488341] [PMID: 22931491]
[53]
Le, N.Q.K.; Ho, Q.T.; Ou, Y.Y. Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. J. Comput. Chem., 2017, 38(23), 2000-2006.
[http://dx.doi.org/10.1002/jcc.24842] [PMID: 28643394]
[54]
Semwal, R.; Aier, I.; Varadwaj, P.K. PROcket, an Efficient Algorithm to Predict Protein Ligand Binding Site; Springer, 2019, pp. 453-461.
[55]
Abma, B. Evaluation of requirements management tools with support for traceability-based change impact analysis. Master's thesis, University of Twente, Enschede, 2009.
[56]
Valverde-Albacete, F.J.; Carrillo-de-Albornoz, J.; Pelaez-Moreno, C. In a proposal for new evaluation metrics and result visualization technique for sentiment analysis tasks. International Conference of the Cross-Language Evaluation Forum for European Languages, 2013, pp. 41-52.
[http://dx.doi.org/10.1007/978-3-642-40802-1_5]
[57]
Valverde-Albacete, F.J.; Pelaez-Moreno, C. 100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS One, 2014, 9(1), e84217
[http://dx.doi.org/10.1371/journal.pone.0084217] [PMID: 24427282]
[58]
Van Asch, V. Macro-and micro-averaged evaluation measures [basic draft Belgium. CLiPS, 2013, 1, 27.
[59]
Semwal, R.; Aier, I.; Raj, U.; Varadwaj, P.K. Pharmadoop: a tool for pharmacophore searching using Hadoop framework. Netw. Model. Anal. Health Inform. Bioinform., 2017, 6(1), 20.
[http://dx.doi.org/10.1007/s13721-017-0161-x]
[60]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: machine learning in Python. J. Mach. Learn. Res., 2011, 12, 2825-2830.
[61]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett., 2006, 27(8), 861-874.
[http://dx.doi.org/10.1016/j.patrec.2005.10.010]
[62]
Almagro Armenteros, J.J.; Sonderby, C.K.; Sonderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 2017, 33(21), 3387-3395.
[http://dx.doi.org/10.1093/bioinformatics/btx431] [PMID: 29036616]
[63]
Yu, C.S.; Lin, C.J.; Hwang, J.K. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci., 2004, 13(5), 1402-1406.
[http://dx.doi.org/10.1110/ps.03479604]
[64]
Rastogi, S.; Rost, B. LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana. Nucleic Acids Res., 2010, 39(1), D230-D234.


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 21
ISSUE: 7
Year: 2020
Page: [546 - 557]
Pages: 12
DOI: 10.2174/1389202921999200528160534
Price: $65

Article Metrics

PDF: 21
HTML: 2
EPUB: 1
PRC: 1