Background: Unbalanced data is a well-known and common problem in many practical
applications of machine learning, having remarkable effects on the performance of standard classifiers.
Taking into account the enormous growth of biomedical literature publicly available over Internet, one
relevant task for the biomedical community is the automatic classification of relevant documents for
Objective: Focusing on this topic, the objective of this work is two-fold: to evaluate alternative
strategies also proposing a novel approach denoted as LITl (Limited Iterative Tomek links) for
alleviating the class imbalance problem, and to analyse the true impact of unbalanced data for the
accurate triage of biomedical documents.
Method: Different strategies are applied and evaluated over a standard corpus of Medline documents
where each entry is represented by a set of MeSH terms.
Results: Results obtained from experimentation demonstrate the real effect of class imbalance over
popular classifiers such as kNN, Naive Bayes, SVM and C4.5, and show how their performance can be
improved when using appropriate balancing strategies.
Conclusion: The classifier that least suffers from an imbalanced scenario comprising Medline
documents is Naive Bayes. Moreover, we demonstrated that the performance of a given balancing
strategy largely depends on the selected classifier. In this sense, those classifiers that are best suited to
work with our LITl approach are kNN and C4.5.