Recognizing Proteins with Binding Function in Elymus nutans Based on Machine Learning Methods

Author(s): Zhe Yang, Juan Wang*, Jia Yang, Zhi Qi, Jiahao He

Journal Name: Combinatorial Chemistry & High Throughput Screening
Accelerated Technologies for Biotechnology, Bioassays, Medicinal Chemistry and Natural Products Research

Volume 23 , Issue 6 , 2020

Become EABM
Become Reviewer
Call for Editor


Background: We research the binding function proteins in Elymus nutans. Recognition for proteins is essential for study of biology. Machine learning methods have been widely used for the prediction of proteins.

Methods: We used BLAST software for the function annotations of Elymus nutans. Besides, we used machine learning methods to recognize proteins which are not annotated by the software. In the process, we focused on identifying the proteins with binding functions. In our research, features are extracted by four algorithms, and then selected by mutual information estimator. Here three classifiers are constructed based on K-nearest neighbour algorithm and gradient boosting algorithm.

Results and Conclusion: Experimental results show that there are 848 proteins with ATP binding function, 113 proteins with heme binding function, 315 proteins with zinc-ion binding function, 135 proteins with GTP binding function and 21 proteins with ADP binding function. Furthermore, we have successfully predicted the functions of 10 special protein sequences whose function annotations cannot be obtained by making sequence alignment with seven famous protein databases. Among them, seven sequences have ATP binding functions, one sequence has heme binding function, one sequence has zinc-ion binding function and the other one has GTP binding function.

Keywords: Protein, binding function, machine learning, feature, ATP, GTP.

Dou, Q.W.; Zhi-Guo, C.; Yong-An, L.; Breedingence, T.H.J. High frequency of karyotype variation revealed by sequential FISH and GISH in plateau perennial grass forage Elymus nutans. Breed. Sci., 2009, 59, 651-656.
Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bairoch, A. UniProtKB/Swiss-Prot. Methods Mol. Biol., 2007, 406, 89-112.
[] [PMID: 18287689]
Sherlock, G. Gene Ontology: tool for the unification of biology. Canadian Inst. Food Sci. Technol. J., 2009, 22, 415.
Wei, L. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Inf. Sci., 2016, 384, 135-144.
Mishra, A.; Pokhrel, P.; Hoque, M.T. StackDPPred: a stacking based prediction of DNA-binding protein from sequence. Bioinformatics, 2019, 35(3), 433-441.
[] [PMID: 30032213]
Cao, R.; Freitas, C.; Chan, L.; Sun, M.; Jiang, H.; Chen, Z. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules, 2017, 22(10), 1732.
[] [PMID: 29039790]
Xu, Y.; Wang, Y.; Luo, J.; Zhao, W.; Zhou, X. Deep learning of the splicing (epi)genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res., 2017, 45(21), 12100-12112.
[] [PMID: 29036709]
Kulmanov, M.; Khan, M.A.; Hoehndorf, R.; Wren, J. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 2018, 34(4), 660-668.
[] [PMID: 29028931]
Chen, K.; Mizianty, M.J.; Kurgan, L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics, 2012, 28(3), 331-341.
[] [PMID: 22130595]
Song, L.; Li, D.; Zeng, X.; Wu, Y.; Guo, L.; Zou, Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics, 2014, 15, 298.
[] [PMID: 25196432]
Zhang, J.; Chai, H.; Gao, B.; Yang, G.; Ma, Z. HEMEsPred: structure-based ligand-specific heme binding residues prediction by using fast-adaptive ensemble learning scheme. IEEE/ACM Trans. Comput. Biol., 2018, 15(1), 147-156.
[] [PMID: 28029626]
Le, N.Q.K.; Ou, Y.Y. Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins. BMC Bioinformatics, 2016, 17(Suppl. 19), 501.
[] [PMID: 28155651]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997, 25(17), 3389-3402.
[] [PMID: 9254694]
Ross, B.C. Mutual information between discrete and continuous data sets. PLoS One, 2014, 9(2)e87357
[] [PMID: 24586270]
Juan, Wang M.G. A review of metrics measuring dissimilarity for rooted phylogenetic networks. Brief. Bioinform., 2019, 20(6), 1972-1980.
[] [PMID: 30020404]
Xu, Y.; Zhou, X. Applications of Single-Cell Sequencing for Multiomics. Methods Mol. Biol., 2018, 1754, 327-374.
[] [PMID: 29536452]
Grabherr, M.G.; Haas, B.J.; Yassour, M.; Levin, J.Z.; Thompson, D.A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q.; Chen, Z.; Mauceli, E.; Hacohen, N.; Gnirke, A.; Rhind, N.; di Palma, F.; Birren, B.W.; Nusbaum, C.; Lindblad-Toh, K.; Friedman, N.; Regev, A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol., 2011, 29(7), 644-652.
[] [PMID: 21572440]
Apweiler, R.; Bairoch, A.; Wu, C.H.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M.J.; Natale, D.A.; O’Donovan, C.; Redaschi, N.; Yeh, L.S. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res., 2004, 32(Database issue), D115-D119.
[] [PMID: 14681372]
Tatusov, R.L.; Galperin, M.Y.; Natale, D.A.; Koonin, E.V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 2000, 28(1), 33-36.
[] [PMID: 10592175]
Koonin, E.V.; Fedorova, N.D.; Jackson, J.D.; Jacobs, A.R.; Krylov, D.M.; Makarova, K.S.; Mazumder, R.; Mekhedov, S.L.; Nikolskaya, A.N.; Rao, B.S.; Rogozin, I.B.; Smirnov, S.; Sorokin, A.V.; Sverdlov, A.V.; Vasudevan, S.; Wolf, Y.I.; Yin, J.J.; Natale, D.A. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol., 2004, 5(2), R7-R7.
[] [PMID: 14759257]
Huerta-Cepas, J.; Szklarczyk, D.; Forslund, K.; Cook, H.; Heller, D.; Walter, M.C.; Rattei, T.; Mende, D.R.; Sunagawa, S.; Kuhn, M.; Jensen, L.J.; von Mering, C.; Bork, P. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res., 2016, 44(D1), D286-D293.
[] [PMID: 26582926]
Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res., 2004, 32(Database issue), D277-D280.
[] [PMID: 14681412]
Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006, 22(13), 1658-1659.
[] [PMID: 16731699]
Ahmad, S.; Sarai, A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 2005, 6, 33.
[] [PMID: 15720719]
Xu, R.; Zhou, J.; Wang, H.; He, Y.; Wang, X.; Liu, B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol., 2015, 9(Suppl. 1), S10.
[] [PMID: 25708928]
Zhang, S.; Duan, X. Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J. Theor. Biol., 2018, 437, 239-250.
[] [PMID: 29100918]
Yang, Z.; Wang, J.; Zheng, Z.; Bai, X. A new method for recognizing cytokines based on feature combination and a support vector machine classi□er. Molecules, 2018, 23, 2008.
Li, D.; Ju, Y.; Zou, Q. Protein folds prediction with hierarchical structured SVM. Curr. Proteomics, 2016, 13, 79-85.
Zhang, T.L.; Ding, Y.S.; Chou, K.C. Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. J. Theor. Biol., 2008, 250(1), 186-193.
[] [PMID: 17959199]
Chou, K.C. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun., 2000, 278(2), 477-483.
[] [PMID: 11097861]
Shen, H.B.; Chou, K.C. Ensemble classifier for protein fold pattern recognition. Bioinformatics, 2006, 22(14), 1717-1722.
[] [PMID: 16672258]
Stumbo, C.R. Thermobacteriology in Food Processing; Academic Press: New York, 1965.
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory, 1967, 13, 21-27.
Ho, T.K. International Conference on Document Analysis and Recognition, 1995, p. 278.
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn., 1995, 20, 273-297.
Xu, Y.; Zhao, W.; Olson, S.D.; Prabhakara, K.S.; Zhou, X. Alternative splicing links histone modifications to stem cell fate decision. Genome Biol., 2018, 19(1), 133.
[] [PMID: 30217220]

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2020
Published on: 04 October, 2020
Page: [554 - 562]
Pages: 9
DOI: 10.2174/1386207323666200330120154
Price: $65

Article Metrics

PDF: 15
PRC: 1