Background and objective: Colorectal cancer (CRC) is a common malignant tumor of the digestive system; it is associated with high morbidity and mortality. However, an early prediction of colorectal adenoma (CRA) that is a precancerous disease of most CRC patients provides an opportunity to make an appropriate strategy for prevention, early diagnosis and treatment. We aimed to build a machine learning model to predict CRA that could assist physicians in classifying high-risk patients and make informed choices, prevent CRC.
Methods: We instructed patients who had undergone a colonoscopy to fill out a questionnaire at the Sixth People Hospital of Shanghai in China from July 2018 to November 2018. A classification model with the gradient boosting decision tree (GBDT) was developed to predict CRA. This model was compared with three other models, namely, random forest (RF), support vector machine (SVM), and logistic regression (LR). The area under the receiver operating characteristic curve (AUC) was used to evaluate performance of the models.
Results: Among the 245 included patients, 65 patients had CRA. The area under the receiver operating characteristic (AUCs) of GBDT, RF, SVM ,and LR with 10 fold-cross validation were 0.8131, 0.74, 0.769 and 0.763. We also built an online prediction service, CRA Inference System, to substantialize the proposed solution for patients with CRA.
Conclusion: We developed and compared four classification models for CRA prediction, and the GBDT model showed the highest performance. Implementing a GBDT model for screening can reduce the cost of time and money and help physicians identify high-risk groups for primary prevention.