Predicting enzyme subfamily class is an imbalance multi-class classification problem due to the fact that the number of proteins in each subfamily makes a great difference. In this paper, we focus on developing the computational methods specially designed for the imbalance multi-class classification problem to predict enzyme subfamily class. We compare two support vector machine (SVM)-based methods for the imbalance problem, AdaBoost algorithm with RBFSVM (SVM with RBF kernel) and SVM with arithmetic mean (AM) offset (AM-SVM) in enzyme subfamily classification. As input features for our predictive model, we use the conjoint triad feature (CTF). We validate two methods on an enzyme benchmark dataset, which contains six enzyme main families with a total of thirty-four subfamily classes, and those proteins have less than 40% sequence identity to any other in a same functional class. In predicting oxidoreductases subfamilies, AM-SVM obtains the over 0.92 Matthews correlation coefficient (MCC) and over 93% accuracy, and in predicting lyases, isomerases and ligases subfamilies, it obtains over 0.73 MCC and over 82% accuracy. The improvement in the predictive performance suggests the AM-SVM might play a complementary role to the existing function annotation methods.
Keywords: Enzyme subfamily class prediction, conjoint triad feature, imbalance problem, support vector machine, subfamily, conjoint triad, support, (SVM), <, (CTF), (MCC), Enzymes, (EC), grey theory, GalNAc-transferase, (AAC), (PPI), RBFSVM, AdaBoostSVM, RBF, jackknife test, PseAAC, (NPPC)