Background: Malignant Mesothelioma (MM) is a rare but aggressive tumor that arises in the lungs. Commonly, costly imaging and laboratory resources, i.e. (X-rays imaging, Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET) scans, biopsies, and blood tests) have already been utilized for the diagnosis of MM. Even though these diagnostic measures are expensive and unavailable in distant areas, some of these diagnosis methods are also very painful for the patient, i.e., biopsy and cytology of pleural fluid.
Objective: In this study, we proposed a diagnosis model for early identification of MM via machine learning techniques. We explored the health records of 324 Turkish patients, which show the symptoms related to MM. The data of patients include socio-economic, geographical, and clinical features.
Methods: Different feature selection methods have been employed for the selection of significant features. To overcome the data imbalance problem, various data-level resampling techniques have been utilized to obtain efficient results. The Gradient Boosted Decision Tree (GBDT) method has been used to develop the diagnostic model. The performance of the GBDT model is also compared with traditional machine learning algorithms.
Results: Our model's results outperformed other models, both on balance and imbalance data. The results clearly show that undersampling techniques outperformed by imbalanced data even without resampling based on accuracy and Receiving Operating Characteristic (ROC) value. Conversely, it has also been observed that oversampling techniques outperformed undersampling and imbalanced data based on accuracy and ROC. All classifiers employed in this study achieved efficient results utilizing feature selection-based methods (OneR, information gain, and Relief-F), but the results of the other two methods (gain ratio and Correlation) were not entirely promising. Finally, when the combination of Synthetic Minority Oversampling Technique (SMOTE) and OneR was applied with GBDT, it gave the most favorable results based on accuracy, F-measure, and ROC.
Conclusion: The diagnosis model has also been deployed to assist doctors, patients, medical practitioners, and other healthcare professionals for early diagnosis and better treatment of MM.