Determining the Influence of Class Imbalance for the Triage of Biomedical Documents

Jorge       Fdez-Glez; David       Ruano-Ordás; José    R.    Méndez; Florentino       Fdez-Riverola; Rosalía       Laza; Reyes       Pavón

Abstract

Background: Unbalanced data is a well-known and common problem in many practical applications of machine learning, having remarkable effects on the performance of standard classifiers. Taking into account the enormous growth of biomedical literature publicly available over Internet, one relevant task for the biomedical community is the automatic classification of relevant documents for further research.

Objective: Focusing on this topic, the objective of this work is two-fold: to evaluate alternative strategies also proposing a novel approach denoted as LITl (Limited Iterative Tomek links) for alleviating the class imbalance problem, and to analyse the true impact of unbalanced data for the accurate triage of biomedical documents.

Method: Different strategies are applied and evaluated over a standard corpus of Medline documents where each entry is represented by a set of MeSH terms.

Results: Results obtained from experimentation demonstrate the real effect of class imbalance over popular classifiers such as kNN, Naive Bayes, SVM and C4.5, and show how their performance can be improved when using appropriate balancing strategies.

Conclusion: The classifier that least suffers from an imbalanced scenario comprising Medline documents is Naive Bayes. Moreover, we demonstrated that the performance of a given balancing strategy largely depends on the selected classifier. In this sense, those classifiers that are best suited to work with our LITl approach are kNN and C4.5.

Keywords: Unbalanced data, document triage, Medline manuscripts, MeSH terms, LITl algorithm, naive bayes.

« Previous Next »

Graphical Abstract

Rights & Permissions Print Cite

Article Metrics

22

2

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893612666170718151238	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Determining the Influence of Class Imbalance for the Triage of Biomedical Documents

Abstract

Graphical Abstract

Related Journals

Related Books