Background: Metabolomics is a relatively new and dominant branch of bioinformatics.
Metabolite expression level controls the phenotypic characteristics of any organism. Recently, breast
cancer is the leading type of cancer in women across the world, accounting for 25% of all cases. In
2012, it was seen that due to breast cancer, there were 1.68 million cases and 522,000 deaths. Therefore,
for drug discovery as well as for early disease status prediction, significant metabolites identification
for breast cancer and correct classification of the breast cancer status through classification technique
are very important for metabolomics data analysis.
Objective: The main objective of this paper is to identify significant metabolites (p-value<0.05) and
state of the art classification technique for breast cancer prediction using metabolomics dataset.
Methods: Although there are several techniques to identify significant metabolites, here, we took Student's
t-test and Kruskal-Wallis test for significant metabolites identification. To classify the breast
cancer prediction, we considered five modern classification techniques- (i) Naive Bayes (NB) (ii)
Support Vector Machine (SVM) (iii) Linear Discriminant Analysis (LDA) (iv) k-nearest neighbors algorithm
(kNN) and (v) Random Forest (RF). We also measured the performances of the classification
techniques through accuracy, sensitivity, specificity, Receiver Operating Characteristic (ROC) curve
and area under the ROC curve etc.
Results: The performance measures of different classification techniques showed that random forest
classifier produced higher accuracy, sensitivity, specificity and area under the ROC curve compared to
the other classification techniques for breast cancer prediction using metabolomics dataset. The analytical
results also showed that there are 24 significant (adjusted p-value < 0.05) metabolites influencing
Conclusion: On the basis of the experimental results, we could say that there are 24 breast cancer influencing
metabolites and for breast cancer prediction as well as metabolomics data analysis, random
forest is the state of the art and outlier-robust classifier among the five classification techniques.