Aims: A number of sequence-based descriptors for proteins have been proposed by
many researchers. This study aims to evaluate the performance of these descriptors in predicting
protein-protein interactions on the benchmark dataset.
Background: The behavior of a protein inside or outside the cell is defined by its interaction with
the elements present in the surrounding environment, which include small metabolites to the macromolecules
such as RNA, DNA, or proteins. Of these, understanding protein-protein interactions
(PPIs) is one of the important aspects to investigate the biological role of a protein. The interactions
of a protein are determined by how it folds in 3-dimensional space, and this threedimensional
folding of a protein largely depends on the linear sequence of amino acids. This information
makes it possible to exploit the sequences for proteins to computationally determine the
possible interactions among them.
Objective: This study aims at studying the efficacy of various sequence-based descriptors in predicting
Methods: In this study, we have used the benchmark dataset of interacting and non-interacting
protein pairs provided by Pan et al. to build the PPI prediction models using artificial neural networks.
We have compared the efficacy of different descriptors on two types of datasets, one with
all the protein pairs and the second with proteins having less than 25% identity.
Result: The results show that conjoint-triad descriptors performed better than other descriptors in
predicting PPIs. The feature selection on the conjoint triad was performed and the effect on the
prediction model with reduced features versus all feature sets was studied.
Conclusion: The classification model with conjoint-triad descriptors obtained the highest accuracy.
The feature ranking for the conjoint triad descriptor was utilized and the model performance
was compared with all and selected features. The model with reduced features shows less overfitting.