Generic placeholder image

Current Bioinformatics


ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers Using Machine Learning Techniques Applied to Lung Cancer

Author(s): Javier Bajo-Morales*, Juan Manuel Galvez, Juan Carlos Prieto-Prieto, Luis Javier Herrera, Ignacio Rojas and Daniel Castillo-Secilla

Volume 17, Issue 2, 2022

Published on: 12 January, 2022

Page: [150 - 163] Pages: 14

DOI: 10.2174/1574893616666211005114934

Price: $65


Background: Nowadays, gene expression analysis is one of the most promising pillars for understanding and uncovering the mechanisms underlying the development and spread of cancer. In this sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained from older technologies, such as Microarray, which could still be useful to extract relevant knowledge.

Methods: Throughout this research, a complete machine learning methodology to cross-evaluate the compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented. In order to show a real application of the designed pipeline, a lung cancer case study is addressed by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic datasets considered for our study have been obtained from the public repositories NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples belonging to these cancer subtypes have been developed.

Results: The predictive models built using one technology are capable of discerning samples from a different technology. The classification results are evaluated in terms of accuracy, F1-score and ROC curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship with lung cancer are reviewed, encountering strong biological evidence linking them to the disease.

Conclusion: Our method has the capability of finding strong gene signatures which are also independent of the transcriptomic technology used to develop the analysis. In addition, our article highlights the potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies, increasing the statistical significance of the results.

Keywords: Lung cancer, microarray, RNA-Seq, gene expression, machine learning, feature selection, CDSS.

Graphical Abstract
Cancer WHOint [cited 2021 Jan 14] Available from:.
Cancer Tomorrow IARCfr [cited 2021 Jan 14] Available from:.
Liu Z-P. Identifying network-based biomarkers of complex diseases from high-throughput data. Biomark Med 2016; 10(6): 633-50.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23(19): 2507-17.
Gálvez JM, Castillo D, Herrera LJ, et al. Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PLoS One 2018; 13(5)e0196836
Gómez-López G, Dopazo J, Cigudosa JC, Valencia A, Al-Shahrour F. Precision medicine needs pioneering clinical bioinformaticians. Brief Bioinform 2019; 20(3): 752-66.
Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018; 68(6): 394-424.
What is lung cancer? [Internet] Cancerorg [cited 2021 Jan 14] Available from:.
Rosti G, Bevilacqua G, Bidoli P, Portalone L, Santo A, Genestreti G. Small cell lung cancer. Ann Oncol 2006; 17: ii5-ii10.
Shang H, Liu Z-P. Network-based prioritization of cancer genes by integrative ranks from multi-omics data. Comput Biol Med 2020; 119(103692): 103692.
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA Microarray. Science 1995; 270(5235): 467-70.
Sanchez-Palencia A, Gomez-Morales M, Gomez-Capilla JA, Pedraza V, Boyero L, Rosell R, et al. Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer 2011; 129(2): 355-64.
Ayyad SM, Saleh AI, Labib LM. Gene expression cancer classification using modified K-Nearest Neighbors technique. Biosystems 2019; 176: 41-51.
Shukla AK, Singh P, Vardhan M. A two-stage gene selection method for biomarker discovery from microarray data for cancer classification. Chemometr Intell Lab Syst 2018; 183: 47-58.
van ’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415(6871): 530-6.
Ozsolak F, Milos PM. RNA sequencing: Advances, challenges and opportunities. Nat Rev Genet 2011; 12(2): 87-98.
Wang C, Tan S, Liu W-R, et al. RNA-Seq profiling of circular RNA in human lung adenocarcinoma and squamous cell carcinoma. Mol Cancer 2019; 18(1): 134.
Liang J, Lv J, Liu Z. Identification of stage-specific biomarkers in lung adenocarcinoma based on RNA-seq data. Tumour Biol 2015; 36(8): 6391-9.
Nookaew I, Papini M, Pornputtapong N, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with Microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 2012; 40(20): 10084-97.
Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data. PLoS One 2013; 8(8): e71462.
Su Z, Fang H, Hong H, et al. An investigation of biomarkers derived from legacy microarray data for their utility in the RNA-seq era. Genome Biol 2014; 15(12): 523.
Bauer M, Ashby C, Wardell C, Morgan G, Walker B. A detailed exploration of using RNA-Seq data in established multiple myeloma gene expression profile microarray based risk scores. Clin Lymphoma Myeloma Leuk 2019; 19(10): e57-8.
van der Kloet FM, Buurmans J, Jonker MJ, Smilde AK, Westerhuis JA. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput Biol 2020; 16(9): e1008295.
Gene Expression Omnibus Available from:.
ArrayExpress – functional genomics data Available from:.<EMBLEBIhttps://
Genomic Data Commons Data Portal Available from:.
Castillo-Secilla D, Gálvez JM, Carrillo-Perez F, et al. KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge. Comput Biol Med 2021; 133: 104387.
Gautier L, Cope L, Bolstad BM, Irizarry RA. affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004; 20(3): 307-15.
Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics 2008; 24(13): 1547-8.
Walfish S. A review of statistical outlier methods. Pharm Technol 2006; 30(11): 82.
Fujita A, Sato JR, Demasi MAA, Sogayar MC, Ferreira CE, Miyano S. Comparing Pearson, Spearman and Hoeffding’s d measure for gene expression association analysis. J Bioinform Comput Biol 2009; 7(04): 663-84.
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97(457): 77-87.
Massey FJ Jr. The kolmogorov-smirnov test for goodness of fit. J Am Stat Assoc 1951; 46(253): 68-78.
Lazar C, Meganck S, Taminau J, et al. Batch effect removal methods for Microarray gene expression data integration: A survey. Brief Bioinform 2013; 14(4): 469-90.
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 2007; 3(9): 1724-35.
Witten D, Tibshirani R. A comparison of fold-change and the t-statistic for Microarray data analysis. Analysis 2007; 1776: 58-85.
Schaffer C. Selecting a classification method by cross-validation. Mach Learn 1993; 13(1): 135-43.
Castillo D, Gálvez JM, Herrera LJ, et al. Leukemia multiclass assessment and classification from Microarray and RNA-seq technologies integration at gene expression level. PLoS One 2019; 14(2)e0212127
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27(8): 1226-38.
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13(1): 21-7.
Awoyemi JO, Adetunmbi AO, Oluwadare SA. Credit card fraud detection using machine learning techniques: A comparative analysis.En: 2017 International Conference on Computing Networking and Informatics (ICCNI). 2017; pp. 1-9.
Kim SJ, Cho KJ, Oh S. Development of machine learning models for diagnosis of glaucoma. PLoS One 2017; 12(5): e0177726.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020; 21(1): 6.
Wardhani NWS, Rochayani MY, Iriany A, Sulistyono AD, Lestantyo P. Cross-validation metrics for evaluating classification performance on imbalanced data. In: En: 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA). 2019; pp. 14-.
Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr 1974; 19(6): 716-23.
Zhao J, Cheng W, He X, Liu Y, Li J, Sun J, et al. Construction of a specific SVM classifier and identification of molecular markers for lung adenocarcinoma based on lncRNA-miRNA-mRNA network. OncoTargets Ther 2018; 11: 3129-40.
Fan Z, Xue W, Li L, Zhang C, Lu J, Zhai Y, et al. Identification of an early diagnostic biomarker of lung adenocarcinoma based on co-expression similarity and construction of a diagnostic model. J Transl Med 2018; 16(1) [].
Rustam Z, Kharis SAA. Comparison of support vector machine recursive feature elimination and kernel function as feature selection using support vector machine for lung cancer classification. J Phys Conf Ser 2020; 1442: 012027.
Smolander J, Stupnikov A, Glazko G, Dehmer M, Emmert-Streib F. Comparing biological information contained in mRNA and non-coding RNAs for classification of lung cancer patients. BMC Cancer 2019; 19(1): 1176.
Yuan F, Lu L, Zou Q. Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochim Biophys Acta Mol Basis Dis 2020; 1866(8): 165822.
Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J. Monte Carlo feature selection for supervised classification. Bioinformatics 2008; 24(1): 110-7.
Watanabe T, Miura T, Degawa Y, et al. Comparison of lung cancer cell lines representing four histopathological subtypes with gene expression profiling using quantitative real-time PCR. Cancer Cell Int 2010; 10(1): 2.
Girard L, Rodriguez-Canales J, Behrens C, et al. An expression signature as an aid to the histologic classification of non-small cell lung cancer. Clin Cancer Res 2016; 22(19): 4880-9.
Gómez-Morales M, Cámara-Pulido M, Miranda-León MT. Differential immunohistochemical localization of desmosomal plaque-related proteins in non-small-cell lung cancer. Histopathology 2013; 63(1): 103-13.
Whitney JF, Clark JM, Griffin TW, Gautam S, Leslie KO. Transferrin receptor expression in non-small cell lung cancer. Histopathologic and clinical correlates. Cancer 1995; 76(1): 2-3.
Wang T, Du G, Wang D. The S100 protein family in lung cancer. Clin Chim Acta 2021; 520: 67-70.
Wang T, Wang N, Zhang L, Liu Y, Thakur A. S100A2: A potential biomarker to differentiate malignant from tuberculous pleural effusion. Indian J Cancer 2021; 58(2): 241-7.
López-Ayllón BD, de Castro-Carpeño J, Rodriguez C, et al. Biomarkers of erlotinib response in non-small cell lung cancer tumors that do not harbor the more common epidermal growth factor receptor mutations. Int J Clin Exp Pathol 2015; 8(3): 2888-98.
Du H, Chen B, Jiao NL, Liu YH, Sun SY, Zhang YW. Elevated Glutathione Peroxidase 2 Expression promotes cisplatin resistance in lung adenocarcinoma. Oxid Med Cell Longev 2020; 2020: 7370157.
Su X, Liu N, Wu W, et al. Comprehensive analysis of prognostic value and immune infiltration of kindlin family members in non-small cell lung cancer. BMC Med Genomics 2021; 14(1): 119.
Zhang Z, Shi R, Xu S, et al. Identification of small proline-rich protein 1B (SPRR1B) as a prognostically predictive biomarker for lung adenocarcinoma by integrative bioinformatic analysis. Thorac Cancer 2021; 12(6): 796-806.
Cassandri M, Butera A, Amelio I, et al. ZNF750 represses breast cancer invasion via epigenetic control of prometastatic genes. Oncogene 2020; 39(22): 4331-43.
Zhang P, He Q, Lei Y, et al. m6A-mediated ZNF750 repression facilitates nasopharyngeal carcinoma progression. Cell Death Dis 2018; 9(12): 1169.
Bi Y, Guo S, Xu X, et al. Decreased ZNF750 promotes angiogenesis in a paracrine manner via activating DANCR/miR-4707-3p/FOXC2 axis in esophageal squamous cell carcinoma. Cell Death Dis 2020; 11(4): 296.
Wu Q, Zhang B, Sun Y, et al. Identification of novel biomarkers and candidate small molecule drugs in non-small-cell lung cancer by integrated microarray analysis. OncoTargets Ther 2019; 12: 3545-63.
Geng Q, Shen Z, Li L, Zhao J. COL1A1 is a prognostic biomarker and correlated with immune infiltrates in lung cancer. PeerJ 2021; 9: e11145.
Jia R, Wang C. MiR-29b-3p reverses cisplatin resistance by targeting COL1A1 in non-small-cell lung cancer A549/DDP cells. Cancer Manag Res 2020; 12: 2559-66.
Yuan X, Yi M, Dong B, Chu Q, Wu K. Prognostic significance of KRT19 in lung squamous cancer. J Cancer 2021; 12(4): 1240-8.
Su C, Liu W-X, Wu L-S, Dong T-J, Liu J-F. Screening of hub gene targets for lung cancer via microarray data. Comb Chem High Throughput Screen 2021; 24(2): 269-85.
Wang W, He J, Lu H, Kong Q, Lin S. KRT8 and KRT19, associated with EMT, are hypomethylated and overexpressed in lung adenocarcinoma and link to unfavorable prognosis. Biosci Rep 2020; 40(7): BSR20193468.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy