MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description

Yi       Zou; Hongjie       Wu; Xiaoyi       Guo; Li       Peng; Yijie       Ding; Jijun       Tang; Fei       Guo

Abstract

Background: Detecting DNA-binding proteins (DBPs) based on biological and chemical methods is time-consuming and expensive.

Objective: In recent years, the rise of computational biology methods based on Machine Learning (ML) has greatly improved the detection efficiency of DBPs.

Methods: In this study, the Multiple Kernel-based Fuzzy SVM Model with Support Vector Data Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted from the protein sequence. Secondly, multiple kernels are constructed via these sequence features. Then, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs.

Results: Our model is evaluated on several benchmark datasets. Compared with other methods, MKFSVM- SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and PDB2272 (0.5476).

Conclusion: We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the classifier for DNA-binding proteins identification.

Keywords: DNA-binding proteins, fuzzy support vector machine, multiple kernel learning, support vector data description, membership function.

« Previous Next »

Graphical Abstract

[1] 
Wang JH, Wang H, Wang XD, et al. Predicting drug-target interactions via FM-DNN Learning. Curr Bioinform  2020; 15(1): 68-76.
[http://dx.doi.org/10.2174/1574893614666190227160538] 
[2] 
Fajila MNF. Gene subset selection for leukemia classification using microarray data. Curr Bioinform  2019; 14(4): 353-8.
[http://dx.doi.org/10.2174/1574893613666181031141717] 
[3] 
Wang Y, Shi FQ, Cao LY, et al. Morphological segmentation analysis and texture-based support vector machines classification on mice liver fibrosis microscopic images. Curr Bioinform  2019; 14(4): 282-94.
[http://dx.doi.org/10.2174/1574893614666190304125221] 
[4] 
Liu G, Jin S, Hu Y, Jiang Q. Disease status affects the association between rs4813620 and the expression of Alzheimer’s disease susceptibility gene TRIB3. Proc Natl Acad Sci USA  2018; 115(45): E10519-20.
[http://dx.doi.org/10.1073/pnas.1812975115] [PMID:  30355771] 
[5] 
Liu G, Hu Y, Han Z, Jin S, Jiang Q. Genetic variant rs17185536 regulates SIM1 gene expression in human brain hypothalamus. Proc Natl Acad Sci USA  2019; 116(9): 3347-8.
[http://dx.doi.org/10.1073/pnas.1821550116] [PMID:  30755538] 
[6] 
Bi XA, Liu Y, Xie Y, Hu X, Jiang Q. Morbigenous brain region and gene detection with a genetically evolved random neural network cluster approach in late mild cognitive impairment. Bioinformatics  2020; 36(8): 2561-8.
[http://dx.doi.org/10.1093/bioinformatics/btz967] [PMID:  31971559] 
[7] 
Jia C, Zuo Y, Zou Q. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics  2018; 34(12): 2029-36.
[http://dx.doi.org/10.1093/bioinformatics/bty039] [PMID:  29420699] 
[8] 
Wei L, Luan S, Nagai LAE, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics  2019; 35(8): 1326-33.
[http://dx.doi.org/10.1093/bioinformatics/bty824] [PMID:  30239627] 
[9] 
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA  2019; 25(2): 205-18.
[http://dx.doi.org/10.1261/rna.069112.118] [PMID:  30425123] 
[10] 
Wang G, Luo X, Wang J, et al. MeDReaders: a database for transcription factors that bind to methylated DNA. Nucleic Acids Res  2018; 46(D1): D146-51.
[http://dx.doi.org/10.1093/nar/gkx1096] [PMID:  29145608] 
[11] 
Shen Y, Ding Y, Tang J, Zou Q, Guo F. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform  2020; 21(5): 1628-40.
[http://dx.doi.org/10.1093/bib/bbz106] [PMID:  31697319] 
[12] 
Wang H, Ding Y, Tang J, et al. Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing  2020; 383(28): 257-69.
[13] 
Wang Y, Ding Y, Tang J, Dai Y, Guo F. CrystalM: a multi-view fusion approach for protein crystallization prediction. IEEE/ACM Trans Comput Biol Bioinformatics  2021; 18(1): 325-35.
[http://dx.doi.org/10.1109/TCBB.2019.2912173] [PMID:  31027046] 
[14] 
Ding Y, Tang J, Guo F. Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinformatics 2019.
[http://dx.doi.org/10.1109/TCBB.2019.2954826] [PMID:  31751248] 
[15] 
Wei L, Ding Y, Su R, et al. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput  2018; 117: 212-7.
[http://dx.doi.org/10.1016/j.jpdc.2017.08.009] 
[16] 
Liu B, Jiang S, Zou Q. HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search. Brief Bioinform 2018.  10.1093/bib/bby104.
[http://dx.doi.org/10.1093/bib/bby104] [PMID:  30403770] 
[17] 
Liu H, Ren G, Chen H, et al. Predicting lncRNA-miRNA interactions based on logistic matrix factorization with neighborhood regularized. Knowl Base Syst  2020; 191, 105261.
[http://dx.doi.org/10.1016/j.knosys.2019.105261] 
[18] 
Ding Y, Tang J, Guo F. Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J Biomed Health Inform  2019; 23(6): 2619-32.
[http://dx.doi.org/10.1109/JBHI.2018.2883834] [PMID:  30507518] 
[19] 
Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing  2019; 325: 211-24.
[http://dx.doi.org/10.1016/j.neucom.2018.10.028] 
[20] 
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of machine learning in microbiology. Front Microbiol  2019; 10: 827.
[http://dx.doi.org/10.3389/fmicb.2019.00827] [PMID:  31057526] 
[21] 
Ru X, Li L, Zou Q. Incorporating distance-based top-n-gram and random forest to identify electron transport proteins. J Proteome Res  2019; 18(7): 2931-9.
[http://dx.doi.org/10.1021/acs.jproteome.9b00250] [PMID:  31136183] 
[22] 
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics  2018; 19: 911.
[http://dx.doi.org/10.1186/s12864-018-5273-x] [PMID:  30598109] 
[23] 
Zeng X, Liu L, Lü L, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics  2018; 34(14): 2425-32.
[http://dx.doi.org/10.1093/bioinformatics/bty112] [PMID:  29490018] 
[24] 
Jiang Q, Wang G, Jin S, Li Y, Wang Y. Predicting human microRNA-disease associations based on support vector machine. Int J Data Min Bioinform  2013; 8(3): 282-93.
[http://dx.doi.org/10.1504/IJDMB.2013.056078] [PMID:  24417022] 
[25] 
Wang G, Wang Y, Teng M, Zhang D, Li L, Liu Y. Signal transducers and activators of transcription-1 (STAT1) regulates microRNA transcription in interferon γ-stimulated HeLa cells. PLoS One  2010; 5(7), e11794.
[http://dx.doi.org/10.1371/journal.pone.0011794] [PMID:  20668688] 
[26] 
Wang G, Wang Y, Feng W, et al. Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics  2008; 9(Suppl. 2): S22.
[http://dx.doi.org/10.1186/1471-2164-9-S2-S22] [PMID:  18831788] 
[27] 
Zhao Y, Wang F, Juan L. MicroRNA promoter identification in arabidopsis using multiple histone markers. BioMed Res Int  2015; 2015, 861402.
[http://dx.doi.org/10.1155/2015/861402] [PMID:  26425556] 
[28] 
Ding Y, Tang J, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl 2019.
[http://dx.doi.org/10.1007/s00521-019-04569-z] 
[29] 
Zhao Q, Yang Y, Ren G, Ge E, Fan C. Integrating bipartite network projection and KATZ measure to identify novel circrna-disease associations. IEEE Trans Nanobioscience  2019; 18(4): 578-84.
[http://dx.doi.org/10.1109/TNB.2019.2922214] [PMID:  31199265] 
[30] 
Zhao X, Jiao Q, Li H, et al. ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. BMC Bioinformatics  2020; 21(1): 43.
[http://dx.doi.org/10.1186/s12859-020-3388-y] [PMID:  32024464] 
[31] 
Ding Y, Tang J, Guo F. Identification of protein-protein interactions via a novel matrix-based sequence representation model with amino acid contact information. Int J Mol Sci  2016; 17(10): 1623.
[http://dx.doi.org/10.3390/ijms17101623] [PMID:  27669239] 
[32] 
Ding Y, Tang J, Guo F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics  2016; 17(1): 398.
[http://dx.doi.org/10.1186/s12859-016-1253-9] [PMID:  27677692] 
[33] 
Liu B, Xu J, Lan X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One  2014; 9(9), e106691.
[http://dx.doi.org/10.1371/journal.pone.0106691] [PMID:  25184541] 
[34] 
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Mol Inform  2015; 34(1): 8-17.
[http://dx.doi.org/10.1002/minf.201400025] [PMID:  27490858] 
[35] 
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep  2015; 5: 15479.
[http://dx.doi.org/10.1038/srep15479] [PMID:  26482832] 
[36] 
Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One  2011; 6(9), e24756.
[http://dx.doi.org/10.1371/journal.pone.0024756] [PMID:  21935457] 
[37] 
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn  2009; 26(6): 679-86.
[http://dx.doi.org/10.1080/07391102.2009.10507281] [PMID:  19385697] 
[38] 
Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics  2007; 8(1): 463.
[http://dx.doi.org/10.1186/1471-2105-8-463] [PMID:  18042272] 
[39] 
Dong Q, Wang S, Kai W, et al.  Identification of DNA-binding
	proteins by auto-cross covariance transformation. IEEE
	International Conference on Bioinformatics and Biomedicine
	(BIBM) USA 2005.. 
[40] 
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci  2017; 384: 135-44.
[http://dx.doi.org/10.1016/j.ins.2016.06.026] 
[41] 
Yijie D, Feng C, Xiaoyi G, et al. Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteomics  2019; 16: 1-9.
[42] 
Liu XJ, Gong XJ, Yu H, Xu JH. A model stacking framework for identifying dna binding proteins by orchestrating multi-view features and classifiers. Genes  2018; 9(8): 394.
[http://dx.doi.org/10.3390/genes9080394] [PMID:  30071697] 
[43] 
Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP-PseAAC: A DNA-binding protein prediction model using Chou’s general PseAAC. J Theor Biol  2018; 452: 22-34.
[http://dx.doi.org/10.1016/j.jtbi.2018.05.006] [PMID:  29753757] 
[44] 
Du X, Diao Y, Liu H, Li S. MsDBP: exploring DNA-binding proteins by integrating multiscale sequence information via Chou’s five-step rule. J Proteome Res  2019; 18(8): 3119-32.
[http://dx.doi.org/10.1021/acs.jproteome.9b00226] [PMID:  31267738] 
[45] 
Adilina S, Farid DM, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. J Theor Biol  2019; 460: 64-78.
[http://dx.doi.org/10.1016/j.jtbi.2018.10.027] [PMID:  30316822] 
[46] 
Wei L, Tang J, Quan Z. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci  2016; 384: 135-44.
[http://dx.doi.org/10.1016/j.ins.2016.06.026] 
[47] 
Zou Y, Ding Y, Tang J, Guo F, Peng L. FKRR-MVSF: a fuzzy kernel ridge regression model for identifying DNA-binding proteins by multi-view sequence features via Chou’s five-step rule. Int J Mol Sci  2019; 20(17): 4175.
[http://dx.doi.org/10.3390/ijms20174175] [PMID:  31454964] 
[48] 
Tax DMJ, Duin RPW. Support vector domain description. Pattern Recognit Lett  1999; 20(11-13): 1191-9.
[http://dx.doi.org/10.1016/S0167-8655(99)00087-2] 
[49] 
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics  2014; 15(Suppl. 15): S9.
[http://dx.doi.org/10.1186/1471-2105-15-S15-S9] [PMID:  25474679] 
[50] 
Li X, Liao B, Shu Y, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol  2009; 261(2): 290-3.
[http://dx.doi.org/10.1016/j.jtbi.2009.07.017] [PMID:  19631664] 
[51] 
Chou K-C, Shen H-B. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun  2007; 360(2): 339-45.
[http://dx.doi.org/10.1016/j.bbrc.2007.06.027] [PMID:  17586467] 
[52] 
Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinformatics  2011; 8(2): 308-15.
[http://dx.doi.org/10.1109/TCBB.2010.93] [PMID:  20855926] 
[53] 
Cristianini N, Shawetaylor J, Elisseeff A, et al. On Kernel-Target Alignment.  Advances in Neural Information Processing Systems Canada 2001; pp. 367-73.
[54] 
Cortes C, Vapnik V. Support-vector networks. Mach Learn  1995; 20(3): 273-97.
[http://dx.doi.org/10.1007/BF00994018] 
[55] 
Lin CF, Wang SD. Fuzzy support vector machines. IEEE Trans Neural Netw  2002; 13(2): 464-71.
[http://dx.doi.org/10.1109/72.991432] [PMID:  18244447] 
[56] 
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One  2014; 9(1), e86703.
[http://dx.doi.org/10.1371/journal.pone.0086703] [PMID:  24475169] 

Rights & Permissions Print Cite

Article Metrics

106

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893615999200607173829	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description

Abstract

Graphical Abstract

Related Journals

Related Books

Related Articles