Background: The performance of the text classification techniques is
commonly affected by the characteristics and representation of the document corpora
itself. Of all the problems arising from the corpus, there are three major difficulties
which the classifiers must deal with: the feature selection issues, the class imbalance
problem and the size of the training set.
Objective: The objective of this paper is to present a novel based-content text classifier
called T-LHMM that is less sensitive to the text representation and the size of the
corpus, and more efficient in terms of running time than other classification techniques.
Method: In order to demonstrate it, we present a set of experiments performed on well-known biomedical text
corpora. We also compare our classifier with k-Nearest Neighbours and Support Vector Machine models.
Results and Conclusion: The experimental and statistical results show that the proposed HMM-based text
classifier is indeed less sensitive to the class imbalance, the size of the corpus and the vocabulary than the
other classifiers. In addition, it is more efficient in terms of running time than k-NN and SVM techniques.