Characterization of cancer related genes is important and challenging in both biomedicine and computational biology. As one of the leading causes of cancer mortality worldwide, lung cancer accounts for over one million deaths each year. Generally, lung cancer can be assigned to small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC). Although great advances have been made in lung cancer detection and treatment, 5-year survival rate of patients is still less than 15%. Hence, it is very important to identify all the potential lung cancer related genes as well as their interaction networks. In this research, we presented a novel computational framework to predict lung cancer related genes based on support vector machine (SVM). 59 NSCLC related genes and 89 SCLC related genes were retrieved from KEGG pathways, while 2950 non-NSCLC and 4450 non- SCLC genes were randomly selected from Ensembl database. 10 datasets were constructed by dividing the genes into 10 groups. Each gene was encoded by a 13,126-dimensional vector comprised of 12,887 Gene Ontology enrichment scores and 239 KEGG enrichment scores. A feature extraction strategy was applied to obtain an optimal feature set including 400 GO terms and 47 KEGG pathways for NSCLC, 458 GO terms and 27 KEGG pathways for SCLC, respectively. Further feature analysis showed that these optimal features were actively involved in lung tumorigenesis. It also confirms that our method is an effective tool for predicting cancer related genes and has the potential to be applied extensively to the prediction of other types of cancer genes.
Keywords: Non-small-cell lung cancer (NSCLC), small-cell lung cancer (SCLC), Gene Ontology (GO), KEGG pathways, support vector machine (SVM), maximum relevance minimum redundancy (mRMR), incremental feature selection (IFS).