A Survey on Computational Methods for Essential Proteins and Genes Prediction

Author(s): Ming Fang, Xiujuan Lei*, Ling Guo

Journal Name: Current Bioinformatics

Volume 14 , Issue 3 , 2019

Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Background: Essential proteins play important roles in the survival or reproduction of an organism and support the stability of the system. Essential proteins are the minimum set of proteins absolutely required to maintain a living cell. The identification of essential proteins is a very important topic not only for a better comprehension of the minimal requirements for cellular life, but also for a more efficient discovery of the human disease genes and drug targets. Traditionally, as the experimental identification of essential proteins is complex, it usually requires great time and expense. With the cumulation of high-throughput experimental data, many computational methods that make useful complements to experimental methods have been proposed to identify essential proteins. In addition, the ability to rapidly and precisely identify essential proteins is of great significance for discovering disease genes and drug design, and has great potential for applications in basic and synthetic biology research.

Objective: The aim of this paper is to provide a review on the identification of essential proteins and genes focusing on the current developments of different types of computational methods, point out some progress and limitations of existing methods, and the challenges and directions for further research are discussed.

Keywords: Essential proteins, essential genes, machine learning algorithms, computational techniques, ensemble methods, Protein-Protein Interaction Network (PIN).

Pal C, Papp B, Hurst LD. Genomic function: Rate of evolution and gene dispensability. Nature 2003; 421(6922): 496-7.
Winzeler EA, Shoemaker DD, Astromoff A, et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999; 285(5429): 901-6.
Yu H, Greenbaum D, Xin Lu H, Zhu X, Gerstein M. Genomic analysis of essentiality within protein networks. Trends Genet 2004; 20(6): 227-31.
Zeng X, Liao Y, Liu Y, Zou Q. Prediction and validation of disease genes using HeteSim scores IEEE/ACM Trans Comput Biol Bioinform 2017; 14(3): 687-95.
Steinmetz LM, Scharfe C, Deutschbauer AM, et al. Systematic screen for human disease genes in yeast. Nat Genet 2002; 31(4): 400-4.
Lu Y, Deng J, Rhodes JC, Lu H, Lu LJ. Predicting essential genes for identifying potential drug targets in Aspergillus fumigatus. Comput Biol Chem 2014; 50: 29-40.
Giaever G, Chu AM, Ni L, et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature 2002; 418(6896): 387-91.
Roemer T, Jiang B, Davison J, et al. Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery. Mol Microbiol 2003; 50(1): 167-81.
Kamath RS, Fraser AG, Dong Y, et al. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 2003; 421(6920): 231-7.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001; 98(8): 4569-74.
Gavin A-C, Bosche M, Krause R, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002; 415(6868): 141-7.
Ho Y, Gruhler A, Heilbut A, et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002; 415(6868): 180-3.
von Mering C, Krause R, Snel B, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002; 417(6887): 399-403.
Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature 2001; 411(6833): 41-2.
Liang H, Li W-H. Gene essentiality, gene duplicability and protein connectivity in human and mouse. Trends Genet 2007; 23(8): 375-8.
Joy MP, Brock A, Ingber DE, Huang S. High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol 2005; 2005(2): 96-103.
Wuchty S, Stadler PF. Centers of complex networks. J Theor Biol 2003; 223(1): 45-53.
Estrada E, Rodriguez-Velazquez JA. Subgraph centrality in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys 2005; 71(5): 056103.
Bonacich P. Power and Centrality: A family of measures. Am J Sociol 1987; 92(5): 1170-82.
Stephenson K, Zelen M. Rethinking centrality: Methods and examples. Soc Networks 1989; 11(1): 1-37.
Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M. The importance of bottlenecks in protein networks: Correlation with gene essentiality and expression dynamics. PLOS Comput Biol 2007; 3(4): 713-20.
Lin C-Y, Chin CH, Wu HH, Chen SH, Ho CW, Ko MT. Hubba: hub objects analyzer - a framework of interactome hubs identification for network biology. Nucleic Acids Res 2008; 36: W438-43.
Li M, Wang J, Chen X, Wang H, Pan Y. A local average connectivity-based method for identifying essential proteins from the network level. Comput Biol Chem 2011; 35(3): 143-50.
Ning K, Ng HK, Srihari S, Leong HW, Nesvizhskii AI. Examination of the relationship between essential genes in PPI network and hub proteins in reverse nearest neighbor topology. BMC Bioinformatics 2010; 11: 505.
del Rio G, Koschutzki D, Coello G. How to identify essential genes from molecular networks? BMC Syst Biol 2009; 3: 102.
Wang J, Li M, Wang H, Pan Y. Identification of essential proteins based on edge clustering coefficient IEEE/ACM Trans Comput Biol Bioinform 2012; 9(4): 1070-80.
Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res 2000; 28(1): 289-91.
Mewes HW, Frishman D, Mayer KFX, et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006; 34(Database issue): D169-72.
Stark C, Breitkreutz B-J, Chatr-aryamontri A, et al. The BioGRID interaction database: 2011 update. Nucleic Acids Res 2011; 39: D698-704.
Wang Y, Sun H, Du W, et al. Identification of essential proteins based on ranking edge-weights in protein-protein interaction networks. PLoS One 2014; 9(9): e108716.
Li M, Lu Y, Wang J, Wu F-X, Pan Y. A topology potential-based method for identifying essential proteins from PPI networks IEEE/ACM Trans Comput Biol Bioinform 2015; 12(2): 372-83.
Estrada E. Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics 2006; 6(1): 35-40.
Hsing M, Byler KG, Cherkasov A. The use of gene ontology terms for predicting highly-connected ‘hub’ nodes in protein-protein interaction networks. BMC Syst Biol 2008; 2: 80.
Li M, Zhang H, Wang J, Pan Y. A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data. BMC Syst Biol 2012; 6: 15.
Zhang X, Xu J, Xiao W-X. A new method for the discovery of essential proteins. PLoS One 2013; 8(3): e 58763.
Li M, Zheng R, Zhang H, Wang J, Pan Y. Effective identification of essential proteins based on priori knowledge, network topology and gene expressions. Methods 2014; 67(3): 325-33.
Peng W, Wang J, Wang W, Liu Q, Wu FX, Pan Y. Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. BMC Syst Biol 2012; 6: 87.
Jordan IK, Rogozin IB, Wolf YI, Koonin EV. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 2002; 12(6): 962-8.
Peng W, Wang J, Cheng Y, Lu Y, Wu F, Pan Y. UDoNC: An algorithm for identifying essential proteins based on protein domains and protein-protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform 2015; 12(2): 276-88.
Hart GT, Lee I, Marcotte ER. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics 2007; 8: 236.
Ren J, Wang J, Li M, Wang H, Liu B. Prediction of essential proteins by integration of PPI network topology and protein complexes information. Proceedings of the International Symposium on Bioinformatics Research and Applications. 2011 May 27-29; Changsha, China. Berlin, Heidelberg: Springer Berlin Heidelberg 2011; pp. 12-24.
Luo J, Ma L. A new integration-centric algorithm of identifying essential proteins based on topology structure of protein-protein interaction network and complex information. Curr Bioinform 2013; 8(3): 380-5.
Luo J, Qi Y. Identification of essential proteins based on a new combination of local interaction density and protein complexes. PLoS One 2015; 10(6): e0131418.
Li M, Lu Y, Niu Z, Wu FX. United complex centrality for identification of essential proteins from PPI networks IEEE/ACM Trans Comput Biol Bioinform 2017; 14(2): 370-80.
Zhang W, Xu J, Li X, Zou X. A new method for identifying essential proteins by measuring co-expression and functional similarity. IEEE Trans NanoBiosci 2016; 15(8): 939-45.
Li G, Li M, Wang J, Wu J, Wu F-X, Pan Y. Predicting essential proteins based on subcellular localization, orthology and PPI networks. BMC Bioinformatics 2016; 17(Suppl. 8): 279.
Fan C, Lei X. Genome-wide identification of essential proteins by integrating RNA-seq, subcellular location and complexes information. Proceedings of the 13th International Conference on Intelligent Computing Theories and Application. 2017 Aug 7-10; Liverpool, UK. Cham: Springer International Publishing 2017; pp. 375-84.
Gustafson AM, Snitkin ES, Parker SCJ, DeLisi C, Kasif S. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 2006; 7: 265.
Hor CY, Yang CB, Yang ZJ, Tseng CT. Prediction of protein essentiality by the support vector machine with statistical tests. Evolutionary Bioinformatics 2013; 9: 387-416.
Acencio ML, Lemke N. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics 2009; 10: 290.
Seringhaus M, Paccanaro A, Borneman A, Snyder M, Gerstein M. Predicting essential genes in fungal genomes. Genome Res 2006; 16(9): 1126-35.
Deng J, Deng L, Su S, et al. Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res 2011; 39(3): 795-807.
Hwang YC, Lin CC, Chang JY, Mori H, Juan HF, Huang HC. Predicting essential genes based on network and sequence analysis. Mol Biosyst 2009; 5(12): 1672-8.
Jeong H, Oltvai ZN, Barabasi AL. Prediction of protein essentiality based on genomic data. Complexus 2003; 1: 19-28.
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci 2017; 384: 135-44.
Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One 2013; 8(2): e56499.
Chen Y, Xu D. Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 2005; 21(5): 575-81.
Saha S, Heber S. In silico prediction of yeast deletion phenotypes. Genet Mol Res 2006; 5(1): 224-32.
Plaimas K, Eils R, Koenig R. Identifying essential genes in bacterial metabolic networks with machine learning methods. BMC Syst Biol 2010; 4: 56.
Yang L, Wang J, Wang H, et al. Analysis and identification of essential genes in humans using topological properties and biological information. Gene 2014; 551(2): 138-51.
Zhong J, Wang J, Peng W, Zhang Z, Li M. A feature selection method for prediction essential protein. Tsinghua Sci Technol 2015; 20(5): 491-9.
Hua HL, Zhang FZ, Labena AA, Dong C, Jin YT, Guo FB. An approach for predicting essential genes using multiple homology mapping and machine learning algorithms. BioMed Res Int 2016; 2016: 7639397.
Chen L, Zhang YH, Wang S, Zhang Y, Huang T, Cai YD. Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways. PLoS One 2017; 12(9): e0184129.
Muller da Silva JP, Acencio ML, Merino Mornbach JC, et al. In silico network topology-based prediction of gene essentiality. Physica A 2008; 387(4): 1049-55.
Cheng J, Xu Z, Wu W, et al. Training set selection for the prediction of essential genes. PLoS One 2014; 9(1): e86805.
Dietterich TG. Ensemble Methods in Machine Learning. In: Multiple Classifier Systems MCS 2000. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg 2000; 857(1): pp: 1-15.
Polikar R, Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst Mag 2006; 6(3): 21-45.
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-prot: Identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics 2014; 15: 298.
Ni Q, Chen L. A feature and algorithm selection method for improving the prediction of protein structural class. Comb Chem High Throughput Screen 2017; 20(7): 612-21.
Chen L, Lu L, Feng K, et al. Multiple classifier integration for the prediction of protein structural classes. J Comput Chem 2009; 30(14): 2248-54.
Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing 2014; 123: 424-35.
Cai YD, Lu L, Chen L, He JF. Predicting subcellular location of proteins using integrated-algorithm method. Mol Divers 2010; 14(3): 551-8.
Zhong J, Wang J, Peng W, Zhang Z, Pan Y. Prediction of essential proteins based on gene expression programming. BMC Genomics 2013; 14: S7.
Hu W, Sillaots S, Lemieux S, et al. Essential gene identification and drug target prioritization in Aspergillus fumigatus. PLoS Pathog 2007; 3(3): e24.
Zhang X, Xiao W, Acencio ML, Lemke N, Wang X. An ensemble framework for identifying essential proteins. BMC Bioinformatics 2016; 17: 322.
Kim W. Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods. Tsinghua Sci Technol 2012; 17(6): 645-58.
Cheng J, Wu W, Zhang Y, et al. A new computational strategy for predicting essential genes. BMC Genomics 2013; 14: 910.
Lin Y, Zhang FZ, Xue K, Gao YZ, Guo FB. Identifying bacterial essential genes based on a feature-integrated method IEEE/ACM Trans Comput Biol Bioinform 2017; doi: 10.1109/TCBB.2017. 2669968.
Luo J, Kuang L. A new method for predicting essential proteins based on dynamic network topology and complex information. Comput Biol Chem 2014; 52: 34-42.
Xiao Q, Wang J, Peng X, Wu FX, Pan Y. Identifying essential proteins from active PPI networks constructed with dynamic gene expression. BMC Genomics 2015; 16(Suppl. 3): S1.
Shang X, Wang Y, Chen B. Identifying essential proteins based on dynamic protein-protein interaction networks and RNA-Seq datasets. Sci China Inf Sci 2016; 59(7): 070106.
Li M, Wang J, Wang H, Pan Y. Identification of essential proteins from weighted protein-protein interaction networks. J Bioinform Comput Biol 2013; 11(3): 1341002.
Luo J, Zhang N. Prediction of essential proteins based on edge clustering coefficient and gene ontology information. J Biol Syst 2014; 22(03): 339-51.
Zotenko E, Mestre J, O’Leary DP, Przytycka TM. Why do hubs in the yeast protein interaction network tend to be essential: Reexamining the connection between the network topology and essentiality. PLOS Comput Biol 2008; 4(8): e1000140.
Han JD, Bertin N, Hao T, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004; 430(6995): 88-93.
Zhao B, Wang J, Li M, Wu FX, Pan Y. Prediction of essential proteins based on overlapping essential modules. IEEE Trans NanoBiosci 2014; 13(4): 415-24.
Tang X, Wang J, Zhong J, Pan Y. Predicting essential proteins based on weighted degree centrality IEEE/ACM Trans Comput Biol Bioinform 2014; 11(2): 407-18.
Jiang Y, Wang Y, Pang W, et al. Essential protein identification based on essential protein-protein interaction prediction by integrated edge weights. Methods 2015; 83: 51-62.
Peng X, Wang J, Wang J, Wu FX, Pan Y. Rechecking the centrality-lethality rule in the scope of protein subcellular localization interaction networks. PLoS One 2015; 10(6): e0130743.
Zhao B, Wang J, Li X, Wu FX. Essential protein discovery based on a combination of modularity and conservatism. Methods 2016; 110: 54-63.
Li M, Ni P, Chen X, Wang J, Wu F, Pan Y. Construction of refined protein interaction network for predicting essential proteins IEEE/ACM Trans Comput Biol Bioinform 2017
Li M, Niu Z, Chen X, Zhong P, Wu F, Pan Y. A reliable neighbor-based method for identifying essential proteins by integrating gene expressions, orthology, and subcellular localization information. Tsinghua Sci Technol 2016; 21(6): 668-77.
Browne F, Zheng H, Wang H, Azuaje F. From experimental approaches to computational techniques: a review on the prediction of protein-protein interactions. Adv Artif Intell 2010; 2010: 924529.
Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 2009; 37: D455-8.
Zhang R, Ou HY, Zhang CT. DEG: a database of essential genes. Nucleic Acids Res 2004; 32: D271-2.
Chen WH, Minguez P, Lercher MJ, Bork P. OGEE: an online gene essentiality database. Nucleic Acids Res 2012; 40(D1): D901-6.
Chen WH, Lu G, Chen X, Zhao XM, Bork P. OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res 2017; 45(D1): D940-4.
Ye YN, Hua Z-G, Huang J, Rao N, Guo F-B. CEG: a database of essential gene clusters. BMC Genomics 2013; 14: 769.
Mobegi FM, Zomer A, de Jonge MI, van Hijum SA. Advances and perspectives in computational prediction of microbial gene essentiality. Brief Funct Genomics 2017; 16(2): 70-9.
D’Elia MA, Pereira MP, Brown ED. Are essential genes really essential? Trends Microbiol 2009; 17(10): 433-8.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Published on: 07 March, 2019
Page: [211 - 225]
Pages: 15
DOI: 10.2174/1574893613666181112150422
Price: $65

Article Metrics

PDF: 48
PRC: 1