Background: Protein-protein interactions (PPI) play a vital role in a wide range of biological
processes starting from cell-cell interactions to developmental control in all organisms.
However, experimental identification of PPI is often laborious, time-consuming and costly compared
to computational prediction. There are several computational prediction models in the literature
based on complete training samples, but none of them dealt with the partial training samples.
Objective: The objective of this work was to develop an effective PPI prediction model for Arabidopsis
Thaliana using partial training samples in a machine learning framework.
Methods: We proposed an effective computational PPI prediction model by combining random
forest (RF) classifier and autocorrelation (AC) sequence encoding features with 1:2 ratio of positive-
PPI and unknown-PPI samples.
Results: We observed that the proposed prediction model produces the highest average performance
scores of sensitivity (94.62%), AUC (0.92) and pAUC (0.189) with the training datasets
and sensitivity (88.14%), AUC (0.89) and pAUC (0.176) with the test datasets of 5-fold crossvalidation
compared to other candidate predictors based on LDA, LOGI, ADA, NB, KNN & SVM
classifiers. It also computed the highest performance scores of TPR (91.82%) and pAUC (0.174)
at FPR= 20% with AUC (0.948) compared to other candidate predictors.
Conclusion: Overall performance of the developed model revealed that our proposed predictor
might be useful to elucidate the biological function of unseen PPIs from a large number of candidate
proteins in Arabidopsis thaliana.