Unbalanced data is a well-known and common problem in many practical applications of machine learning, having remarkable effects on the performance of standard classifiers. Taking into account the enormous growth of biomedical literature publicly available over internet, one relevant task for the biomedical community is the automatic classification of relevant documents for further research. Focusing on this topic, the objective of this work is two-fold: to evaluate alternative strategies for alleviating the class imbalance problem, and to analyse the true impact of unbalanced data for the accurate triage of biomedical documents. Different strategies are applied and evaluated over a standard corpus of Medline documents where each entry is represented by a set of MeSH terms. Results obtained from experimentation demonstrate the real effect of class imbalance over popular classifiers such as kNN, Naive Bayes, SVM and C4.5, and show how their performance can be improved when using appropriate strategies abstract should not exceed 250 words for review papers summarizing the essential features of the article.
Keywords: Unbalanced data, document triage, Medline manuscripts, MeSH terms, LITl algorithm
Rights & PermissionsPrintExport