Predicting Thermophilic Proteins by Machine Learning

Author(s): Xian-Fang Wang*, Peng Gao, Yi-Feng Liu, Hong-Fei Li, Fan Lu

Journal Name: Current Bioinformatics

Volume 15 , Issue 5 , 2020


Become EABM
Become Reviewer
Call for Editor

Graphical Abstract:


Abstract:

Background: Thermophilic proteins can maintain good activity under high temperature, therefore, it is important to study thermophilic proteins for the thermal stability of proteins.

Objective: In order to solve the problem of low precision and low efficiency in predicting thermophilic proteins, a prediction method based on feature fusion and machine learning was proposed in this paper.

Methods: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce the dimension of the expressed protein sequence features in order to reduce the training time and improve efficiency. Finally, the classification model was designed by using the classification algorithm.

Results: A variety of classification algorithms was used to train and test on the selected thermophilic dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife method was over 92%. The combination of other evaluation indicators also proved that the SVM performance was the best.

Conclusion: Because of choosing an effectively feature representation method and a robust classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to most reported methods.

Keywords: Thermophilic proteins, feature fusion, g-gap, entropy density, autocorrelation coefficient, KPCA, machine learning.

[1]
Urbieta MS, Donati ER, Chan KG, Shahar S, Sin LL, Goh KM. Thermophiles in the genomic era: Biodiversity, science, and applications. Biotechnol Adv 2015; 33(6): 633-47.
[PMID: 25911946]
[2]
Sahoo K, Sahoo RK, Gaur M, et al. Cellulolytic thermophilic microorganisms in white biotechnology: a review. Folia Microbiol 2020; 65: 25-43.
[http://dx.doi.org/10.1007/s12223-019-00710-6] [PMID: 31102141]
[3]
Kumar S, Tsai CJ, Nussinov R. Factors enhancing protein thermostability. Protein Eng 2000; 13(3): 179-91.
[PMID: 10775659]
[4]
Gromiha MM. Important inter-residue contacts for enhancing the thermal stability of thermophilic proteins. Biophys Chem 2001; 91(1): 71-7.
[PMID: 11403885]
[5]
Liang HK, Huang CM, Ko MT, Hwang JK. Amino acid coupling patterns in thermophilic proteins. Proteins 2005; 59(1): 58-63.
[PMID: 15688447]
[6]
Zhang GY, Fang BS. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 2006; 41: 1792-8.
[7]
Gromiha MM, Suresh MX. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 2008; 70(4): 1274-9.
[PMID: 17876820]
[8]
Wu LC, Lee JX, Huang HD, et al. An expert system to predict protein thermostability using decision tree. Expert Syst Appl 2009; 36: 9007-14.
[9]
Zuo YC, Chen W, Fan GL, Li QZ. A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 2013; 44(2): 573-80.
[PMID: 22851052]
[10]
Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 2011; 84(1): 67-70.
[PMID: 21044646]
[11]
Hu B, Zheng L, Long C, et al. EmExplorer: a database for exploring time activation of gene expression in mammalian embryos. Open Biol 2019; 9(6) 190054
[PMID: 31164042]
[12]
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010; 26(5): 680-2.
[PMID: 20053844]
[13]
Zou Q, Lin G, Jiang X, et al. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform 2020; 21(1): 1-10.
[http://dx.doi.org/10.1093/bib/bby090] [PMID: 30239587]
[14]
Liu B, Li K. iPromoter-2L2.0: identifying promoters and their types by combining Smoothing Cutting Window algorithm and sequence-based features. Mol Ther Nucleic Acids 2019; 18: 80-7.
[http://dx.doi.org/10.1016/j.omtn.2019.08.008] [PMID: 31536883]
[15]
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics 2017; 33(1): 122-4.
[PMID: 27565583]
[16]
Tang SN, Sun JM, Xiong WW, Cong PS, Li TH. Identification of the subcellular localization of mycobacterial proteins using localization motifs. Biochimie 2012; 94(3): 847-53.
[PMID: 22182488]
[17]
Hu L, Chan KC. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. EEE/ACM Trans Comput Biol Bioinform 2017; (3): 155-66.
[18]
Wei LY, Ding YJ, Su R, et al. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 2018; 117: 212-7.
[19]
Liu D, Li G, Zuo Y. Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Brief Bioinform 2019; 20(5): 1826-35.
[http://dx.doi.org/10.1093/bib/bby053] [PMID: 29947743]
[20]
Liang S, Ma A, Yang S, Wang Y, Ma Q. A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis. Comput Struct Biotechnol J 2018; 16: 88-97.
[PMID: 30275937]
[21]
Lin H, Liu WX, He J, Liu XH, Ding H, Chen W. Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep 2015; 5: Article 16964.
[PMID: 26648527]
[22]
Lin H, Chen W, Ding H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 2013; 8(10) e75726
[PMID: 24130738]
[23]
Tan JX, Li SH, Zhang ZM, et al. Identification of hormone binding proteins based on machine learning methods. Math Biosci Eng 2019; 16(4): 2466-80.
[PMID: 31137222]
[24]
Tang H, Su ZD, Wei HH, Chen W, Lin H. Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016; 477(1): 150-4.
[PMID: 27291150]
[25]
Jiang Z, Wang D, Wu P, et al. Predicting subcellular localization of multisite proteins using differently weighted multi-label k-nearest neighbors sets. Technol Health Care 2019; 27(S1): 185-93.
[PMID: 31045538]
[26]
Du X, Cheng J, Eds. Inferring protein-protein interactions from sequence using sequence order information. Proceedings of the International Conference on Computer Science & Education 2010. Hefei, China.
[27]
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou’s PseAAC. J Theor Biol 2014; 344: 31-9.
[PMID: 24316387]
[28]
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst 2016; 12(4): 1269-75.
[PMID: 26883492]
[29]
Zhu XJ, Feng CQ, Lai HY, et al. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Base Syst 2019; 163: 787-93.
[30]
Cheng L, Yang H, Zhao H, et al. MetSigDis: a manually curated resource for the metabolic signatures of diseases. Brief Bioinform 2019; 20(1): 203-9.
[PMID: 28968812]
[31]
Zhou M, Wang X, Li J, et al. Prioritizing candidate disease-related long non-coding RNAs by walking on the heterogeneous lncRNA and disease network. Mol Biosyst 2015; 11(3): 760-9.
[PMID: 25502053]
[32]
Vyas H, Mathur R. Experimental analysis: Hybrid scheme for face recognition using KPCA & SVD. IEEE International Conference on Computational Intelligence & Communication Technology. Ghaziabad, India. 2015.
[33]
Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform 2019. pii: bbz048
[http://dx.doi.org/10.1093/bib/bbz048] [PMID: 31157855]
[34]
Lin H, Liang ZY, Tang H, et al. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM Trans Comput Biol Bioinform 2019; 16: 1316-21.
[35]
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H. A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. Curr Drug Targets 2019; 20(5): 540-50.
[PMID: 30277150]
[36]
Cortes C, Vapnik VJML. Support-vector networks. Med Leaning 1995; 20: 273-97.
[37]
Xu ZC, Feng PM, Yang H, Qiu WR, Chen W, Lin H. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics 2019; 35(23): 4922-9.
[http://dx.doi.org/10.1093/bioinformatics/btz358] [PMID: 31077296]
[38]
Dao FY, Lv H, Wang F, et al. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 2019; 35(12): 2075-83.
[PMID: 30428009]
[39]
Feng CQ, Zhang ZY, Zhu XJ, et al. iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics 2019; 35(9): 1469-77.
[PMID: 30247625]
[40]
Tang H, Zhao YW, Zou P, et al. HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci 2018; 14(8): 957-64.
[PMID: 29989085]
[41]
Yang W, Zhu XJ, Huang J, et al. A brief survey of machine learning methods in protein sub-Golgi localization. Curr Bioinform 2019; 14: 234-40.
[42]
Dao FY, Chen XX, Lin H. Prediction of thermophilic proteins based on physicochemical properties. Chinese J Bioinform 2017; 15(1): 1-6.
[43]
Zhang Z, Zhao Y, Liao X, et al. Deep learning in omics: a survey and guideline. Brief Funct Genomics 2019; 18(1): 41-57.
[PMID: 30265280]
[44]
Yu L, Sun X, Tian SW, et al. Drug and nondrug classification based on deep learning with various feature selection strategies. Curr Bioinform 2018; 13: 253-9.
[45]
Li Y, Niu M, Zou Q. ELM-MHC: an improved MHC identification method with extreme learning machine algorithm. J Proteome Res 2019; 18(3): 1392-401.
[PMID: 30698979]


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 15
ISSUE: 5
Year: 2020
Published on: 06 February, 2020
Page: [493 - 502]
Pages: 10
DOI: 10.2174/1574893615666200207094357

Article Metrics

PDF: 23
HTML: 3