Objective: Small GTPase is an important molecular switch that plays an important role in
numerous signaling transduction pathways, the aim is to explore its binary classification features with
machine learning algorithms.
Methods: The sequences including small GTPases and non small GTPases were clustered to remove
similar entries, respectively. Then, they were divided into 10 datasets, each containing equal entries of
small GTPases and non small GTPases. These datasets extracted three feature vectors that included188-
dimensional(188D), 400D, and motif-based features (608D). The next step was classification based on
easy-classify.py software in scikit-learn, which integrated 12 classifiers and finally discovered the
conserved motifs by MEME suite.
Results: The three best performed classifiers were logistic regression (LR), gradient boosting decision
tree (GBDT), and bagging for 400D features, LibSVM, GBDT, and bagging for 188D features, and
GBDT, bagging, and AdaBoost for 608D features, respectively. The top four classifiers were GBDT,
bagging, LR, and AdaBoost according to commonly evaluated indices as a whole. GBDT obtained the
highest area under the curve (AUC) value at 88.61%. The 400D features performed better than the 188D
and 608D ones. Five conserved G-box motifs were discovered in the sequences of human small
Conclusion: This study provides the first description of GBDT algorithm performed best for small