Background: Since their introduction in the virtual screening field, Receiver Operating
Characteristic (ROC) curve-derived metrics have been widely used for benchmarking of computational
methods and algorithms intended for virtual screening applications. Whereas in classification problems,
the ratio between sensitivity and specificity for a given score value is very informative, a practical
concern in virtual screening campaigns is to predict the actual probability that a predicted hit will
prove truly active when submitted to experimental testing (in other words, the Positive Predictive Value
- PPV). Estimation of such probability is however, obstructed due to its dependency on the yield of
actives of the screened library, which cannot be known a priori.
Objective: To explore the use of PPV surfaces derived from simulated ranking experiments (retrospective
virtual screening) as a complementary tool to ROC curves, for both benchmarking and optimization
of score cutoff values.
Methods: The utility of the proposed approach is assessed in retrospective virtual screening experiments
with four datasets used to infer QSAR classifiers: inhibitors of Trypanosoma cruzi trypanothione
synthetase; inhibitors of Trypanosoma brucei N-myristoyltransferase; inhibitors of GABA transaminase
and anticonvulsant activity in the 6 Hz seizure model.
Results: Besides illustrating the utility of PPV surfaces to compare the performance of machine learning
models for virtual screening applications and to select an adequate score threshold, our results also
suggest that ensemble learning provides models with better predictivity and more robust behavior.
Conclusion: PPV surfaces are valuable tools to assess virtual screening tools and choose score thresholds
to be applied in prospective in silico screens. Ensemble learning approaches seem to consistently
lead to improved predictivity and robustness.