Background: Colorectal cancer (CRC) is the third most common cancer
worldwide. Cancer discrimination is a typical application of gene expression analysis
using a microarray technique. However, microarray data suffer from the curse of
dimensionality and usual imbalanced class distribution between the majority (tumor
samples) and minority (normal samples) classes. Feature gene selection is necessary
and important for cancer discrimination.
Objectives: To select feature genes for the discrimination of CRC.
Methods: We improve the feature selection algorithm based on differential evolution,
DEFSw by using RUSBoost classifier and weight accuracy instead of the common
classifier and evaluation measure for selecting feature genes from imbalance data. We
firstly extract differently expressed genes (DEGs) from the CRC dataset of the TCGA
and then select the feature genes from the DEGs using the improved DEFSw algorithm.
Finally, we validate the selected feature gene sets using independent datasets and
retrieve the cancer related information for these genes based on text mining through the
Coremine Medical online database.
Results: We select out 16 single-gene feature sets for colorectal cancer discrimination
and 19 single-gene feature sets only for colon cancer discrimination.
Conclusions: In summary, we find a series of high potential candidate biomarkers or
signatures, which can discriminate either or both of colon cancer and rectal cancer with
high sensitivity and specificity.