Predicting Thermophilic Proteins by Machine Learning

Xian-Fang       Wang; Peng       Gao; Yi-Feng       Liu; Hong-Fei       Li; Fan       Lu

Abstract

Background: Thermophilic proteins can maintain good activity under high temperature, therefore, it is important to study thermophilic proteins for the thermal stability of proteins.

Objective: In order to solve the problem of low precision and low efficiency in predicting thermophilic proteins, a prediction method based on feature fusion and machine learning was proposed in this paper.

Methods: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce the dimension of the expressed protein sequence features in order to reduce the training time and improve efficiency. Finally, the classification model was designed by using the classification algorithm.

Results: A variety of classification algorithms was used to train and test on the selected thermophilic dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife method was over 92%. The combination of other evaluation indicators also proved that the SVM performance was the best.

Conclusion: Because of choosing an effectively feature representation method and a robust classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to most reported methods.

Keywords: Thermophilic proteins, feature fusion, g-gap, entropy density, autocorrelation coefficient, KPCA, machine learning.

« Previous Next »

Graphical Abstract

[1] 
Urbieta MS, Donati ER, Chan KG, Shahar S, Sin LL, Goh KM. Thermophiles in the genomic era: Biodiversity, science, and applications. Biotechnol Adv  2015; 33(6): 633-47.
[PMID:  25911946] 
[2] 
Sahoo K, Sahoo RK, Gaur M, et al. Cellulolytic thermophilic microorganisms in white biotechnology: a review. Folia Microbiol  2020; 65: 25-43.
[http://dx.doi.org/10.1007/s12223-019-00710-6] [PMID:  31102141] 
[3] 
Kumar S, Tsai CJ, Nussinov R. Factors enhancing protein thermostability. Protein Eng  2000; 13(3): 179-91.
[PMID:  10775659] 
[4] 
Gromiha MM. Important inter-residue contacts for enhancing the thermal stability of thermophilic proteins. Biophys Chem  2001; 91(1): 71-7.
[PMID:  11403885] 
[5] 
Liang HK, Huang CM, Ko MT, Hwang JK. Amino acid coupling patterns in thermophilic proteins. Proteins  2005; 59(1): 58-63.
[PMID:  15688447] 
[6] 
Zhang GY, Fang BS. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem  2006; 41: 1792-8.
[7] 
Gromiha MM, Suresh MX. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins  2008; 70(4): 1274-9.
[PMID:  17876820] 
[8] 
Wu LC, Lee JX, Huang HD, et al. An expert system to predict protein thermostability using decision tree. Expert Syst Appl  2009; 36: 9007-14.
[9] 
Zuo YC, Chen W, Fan GL, Li QZ. A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids  2013; 44(2): 573-80.
[PMID:  22851052] 
[10] 
Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods  2011; 84(1): 67-70.
[PMID:  21044646] 
[11] 
Hu B, Zheng L, Long C, et al. EmExplorer: a database for exploring time activation of gene expression in mammalian embryos. Open Biol  2019; 9(6) 190054
[PMID:  31164042] 
[12] 
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics  2010; 26(5): 680-2.
[PMID:  20053844] 
[13] 
Zou Q, Lin G, Jiang X, et al. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform  2020; 21(1): 1-10.
[http://dx.doi.org/10.1093/bib/bby090] [PMID:  30239587] 
[14] 
Liu B, Li K. iPromoter-2L2.0: identifying promoters and their types by combining Smoothing Cutting Window algorithm and sequence-based features. Mol Ther Nucleic Acids  2019; 18: 80-7.
[http://dx.doi.org/10.1016/j.omtn.2019.08.008] [PMID:  31536883] 
[15] 
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics  2017; 33(1): 122-4.
[PMID:  27565583] 
[16] 
Tang SN, Sun JM, Xiong WW, Cong PS, Li TH. Identification of the subcellular localization of mycobacterial proteins using localization motifs. Biochimie  2012; 94(3): 847-53.
[PMID:  22182488] 
[17] 
Hu L, Chan KC. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. EEE/ACM Trans Comput Biol Bioinform  2017; (3): 155-66.
[18] 
Wei LY, Ding YJ, Su R, et al. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput  2018; 117: 212-7.
[19] 
Liu D, Li G, Zuo Y. Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Brief Bioinform  2019; 20(5): 1826-35.
[http://dx.doi.org/10.1093/bib/bby053] [PMID:  29947743] 
[20] 
Liang S, Ma A, Yang S, Wang Y, Ma Q. A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis. Comput Struct Biotechnol J  2018; 16: 88-97.
[PMID:  30275937] 
[21] 
Lin H, Liu WX, He J, Liu XH, Ding H, Chen W. Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep  2015; 5: Article 16964.
[PMID:  26648527] 
[22] 
Lin H, Chen W, Ding H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One  2013; 8(10) e75726
[PMID:  24130738] 
[23] 
Tan JX, Li SH, Zhang ZM, et al. Identification of hormone binding proteins based on machine learning methods. Math Biosci Eng  2019; 16(4): 2466-80.
[PMID:  31137222] 
[24] 
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun  2016; 477(1): 150-4.
[PMID:  27291150] 
[25] 
Jiang Z, Wang D, Wu P, et al. Predicting subcellular localization of multisite proteins using differently weighted multi-label k-nearest neighbors sets. Technol Health Care  2019; 27(S1): 185-93.
[PMID:  31045538] 
[26] 
Du X, Cheng J, Eds. Inferring protein-protein interactions from sequence using sequence order information. Proceedings of the International Conference on Computer Science & Education 2010. Hefei, China.
[27] 
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou’s PseAAC. J Theor Biol  2014; 344: 31-9.
[PMID:  24316387] 
[28] 
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst  2016; 12(4): 1269-75.
[PMID:  26883492] 
[29] 
Zhu XJ, Feng CQ, Lai HY, et al. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Base Syst  2019; 163: 787-93.
[30] 
Cheng L, Yang H, Zhao H, et al. MetSigDis: a manually curated resource for the metabolic signatures of diseases. Brief Bioinform  2019; 20(1): 203-9.
[PMID:  28968812] 
[31] 
Zhou M, Wang X, Li J, et al. Prioritizing candidate disease-related long non-coding RNAs by walking on the heterogeneous lncRNA and disease network. Mol Biosyst  2015; 11(3): 760-9.
[PMID:  25502053] 
[32] 
Vyas H, Mathur R. Experimental analysis: Hybrid scheme for face recognition using KPCA & SVD. IEEE International Conference on Computational Intelligence & Communication Technology. Ghaziabad, India. 2015.
[33] 
Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform 2019. pii: bbz048
[http://dx.doi.org/10.1093/bib/bbz048] [PMID:  31157855] 
[34] 
Lin H, Liang ZY, Tang H, et al. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM Trans Comput Biol Bioinform  2019; 16: 1316-21.
[35] 
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H. A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. Curr Drug Targets  2019; 20(5): 540-50.
[PMID:  30277150] 
[36] 
Cortes C, Vapnik VJML. Support-vector networks. Med Leaning  1995; 20: 273-97.
[37] 
Xu ZC, Feng PM, Yang H, Qiu WR, Chen W, Lin H. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics  2019; 35(23): 4922-9.
[http://dx.doi.org/10.1093/bioinformatics/btz358] [PMID:  31077296] 
[38] 
Dao FY, Lv H, Wang F, et al. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics  2019; 35(12): 2075-83.
[PMID:  30428009] 
[39] 
Feng CQ, Zhang ZY, Zhu XJ, et al. iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics  2019; 35(9): 1469-77.
[PMID:  30247625] 
[40] 
Tang H, Zhao YW, Zou P, et al. HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci  2018; 14(8): 957-64.
[PMID:  29989085] 
[41] 
Yang W, Zhu XJ, Huang J, et al. A brief survey of machine learning methods in protein sub-Golgi localization. Curr Bioinform  2019; 14: 234-40.
[42] 
Dao FY, Chen XX, Lin H. Prediction of thermophilic proteins based on physicochemical properties. Chinese J Bioinform  2017; 15(1): 1-6.
[43] 
Zhang Z, Zhao Y, Liao X, et al. Deep learning in omics: a survey and guideline. Brief Funct Genomics  2019; 18(1): 41-57.
[PMID:  30265280] 
[44] 
Yu L, Sun X, Tian SW, et al. Drug and nondrug classification based on deep learning with various feature selection strategies. Curr Bioinform  2018; 13: 253-9.
[45] 
Li Y, Niu M, Zou Q. ELM-MHC: an improved MHC identification method with extreme learning machine algorithm. J Proteome Res  2019; 18(3): 1392-401.
[PMID:  30698979] 

Rights & Permissions Print Cite

Article Metrics

42

4

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893615666200207094357	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Predicting Thermophilic Proteins by Machine Learning

Abstract

Graphical Abstract

Related Journals

Related Books