Generic placeholder image

Current Bioinformatics


ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Quality Control of Gene Expression Data Allows Accurate Quantification of Differentially Expressed Biological Pathways

Author(s): Ellen Reed, Enrico Ferrari and Mikhail Soloviev*

Volume 18, Issue 5, 2023

Published on: 17 April, 2023

Page: [409 - 427] Pages: 19

DOI: 10.2174/1574893618666230221141815

Price: $65


Background: Gene expression signatures provide a promising diagnostic tool for many diseases, including cancer. However, there remain multiple issues related to the quality of gene expression data, which may impede the analysis and interpretation of differential gene expression in cancer.

Objective: We aimed to address existing issues related to the quality of gene expression data and to devise improved quality control (QC) and expression data processing procedures.

Methods: Linear regression analysis was applied to gene expression datasets generated from diluted and pre-mixed matched breast cancer and normal breast tissue samples. Datapoint outliers were identified and removed, and accurate expression values corresponding to cancer and normal tissues were recalculated.

Results: We achieved a 27% increase in the number of identifiable differentially regulated genes and a similar reduction in the number of false positives identified from microarray DEG data. Our approach reduced technical errors and improved the accuracy and precision of determining the degree of DEG but did not remove biological outliers, such as naturally variably expressed genes. We also determined the linear dynamic range of microarray assay directly from expression data, which allowed accurate quantification of differentially expressed entire pathways.

Conclusion: The improved QC allowed accurate discrimination of genes by the degree of their upregulation, which helped to reveal an intricate and highly tuned network of biological pathways and their regulation in cancer. We were able, for the first time, to quantify the degree of transcriptional upregulation of entire individual biological pathways upregulated in breast cancer. It can be concluded that the vast majority of DEG data that are publicly available today may have been generated using sub-optimal experimental design, lacking preparations required for genuinely accurate and quantitative analysis.

Keywords: Gene expression, microarrays, differential gene expression, biological pathways, breast cancer, quality control.

Graphical Abstract
Reue K. mRNA quantitation techniques: Considerations for experimental design and application. J Nutr 1998; 128(11): 2038-44.
[] [PMID: 9808663]
de Sena Brandine G, Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000 Res 2019; 8: 1874.
[] [PMID: 33552473]
Babraham Bioinformatics. Available from:
Patel RK, Jain M. (2012) NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data. PLoS One 7(2): e30619.
NGSQCToolkit version 2.3. Available from:
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 2012; 28(16): 2184-5.
[] [PMID: 22743226]
RSeQC. An RNA-seq Quality Control Package. Available from:
Hartley SW, Mullikin JC. QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments. BMC Bioinformatics 2015; 16(1): 224.
[] [PMID: 26187896]
QoRTs. Quality of RNA-seq Tool-Set Available from:
Zhou Q, Su X, Jing G, Chen S, Ning K. RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data. BMC Genomics 2018; 19(1): 144.
[] [PMID: 29444661]
Yang IS, Kim S. Analysis of whole transcriptome sequencing data: Workflow and Software. Genomics Inform 2015; 13(4): 119-25.
[] [PMID: 26865842]
Sheng Q, Vickers K, Zhao S, et al. Multi-perspective quality control of Illumina RNA sequencing data analysis. Brief Funct Genomics 2017; 16(4): 194-204.
[PMID: 27687708]
Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol 2016; 17(1): 13.
[] [PMID: 26813401]
Cornwell M, Vangala M, Taing L, et al. VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinformatics 2018; 19(1): 135.
[] [PMID: 29649993]
Zhang X, Jonassen I. RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinformatics 2020; 21(1): 110.
[] [PMID: 32183729]
Nextflow-RNASeq. Available from:
Federico A, Karagiannis T, Karri K, et al. Pipeliner: A nextflow-based framework for the definition of sequencing data processing pipelines. Front Genet 2019; 10: 614.
[] [PMID: 31316552]
Lataretu M, Hölzer M. RNAflow: An effective and simple RNA-Seq differential gene expression pipeline using nextflow. Genes (Basel) 2020; 11(12): 1487.
[] [PMID: 33322033]
Oshlack A, Emslie D, Corcoran L, Smyth GK. Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol 2007; 8(1): R2.
[] [PMID: 17204140]
Ritchie ME, Silver J, Oshlack A, et al. A comparison of background correction methods for two-colour microarrays. Bioinformatics 2007; 23(20): 2700-7.
[] [PMID: 17720982]
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015; 43(7): e47.
[] [PMID: 25605792]
Smyth GK, Michaud J, Scott HS. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 2005; 21(9): 2067-75.
[] [PMID: 15657102]
Kauffmann A, Huber W. Microarray data quality control improves the detection of differentially expressed genes. Genomics 2010; 95(3): 138-42.
[] [PMID: 20079422]
Phipson B, Lee S, Majewski IJ, Alexander WS, Smyth GK. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann Appl Stat 2016; 10(2): 946-63.
[] [PMID: 28367255]
Alanni R, Hou J, Azzawi H, Xiang Y. A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med Genomics 2019; 12(1): 10.
[] [PMID: 30646919]
Sun M, Shao X, Wang Y. Microarray data analysis for transcriptome profiling. Methods Mol Biol 2018; 1751: 17-33.
[] [PMID: 29508287]
Mohammed A, Biegert G, Adamec J, Helikar T. Identification of potential tissue-specific cancer biomarkers and development of cancer versus normal genomic classifiers. Oncotarget 2017; 8(49): 85692-715.
[] [PMID: 29156751]
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140: 105051.
[] [PMID: 34839186]
Narrandes S, Xu W. Gene expression detection assay for cancer clinical use. J Cancer 2018; 9(13): 2249-65.
[] [PMID: 30026820]
Mancuso CA, Canfield JL, Singla D, Krishnan A. A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Res 2020; 48(21): e125.
[] [PMID: 33074331]
Castillo D, Gálvez JM, Herrera LJ, Román BS, Rojas F, Rojas I. Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics 2017; 18(1): 506.
[] [PMID: 29157215]
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 2015; 13: 8-17.
[] [PMID: 25750696]
Daoud M, Mayo M. A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 2019; 97: 204-14.
[] [PMID: 30797633]
Shieh AD, Hung YS. Detecting outlier samples in microarray data. Stat Appl Genet Mol Biol 2009; 8(1): 1-24.
[] [PMID: 19222380]
Ritchie ME, Diyagama D, Neilson J, et al. Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics 2006; 7(1): 261.
[] [PMID: 16712727]
Siangphoe U, Archer KJ, Mukhopadhyay ND. Classical and Bayesian random-effects meta-analysis models with sample quality weights in gene expression studies. BMC Bioinformatics 2019; 20(1): 18.
[] [PMID: 30626315]
Kauffmann A, Gentleman R, Huber W. arrayQualityMetrics-A bioconductor package for quality assessment of microarray data. Bioinformatics 2009; 25(3): 415-6.
[] [PMID: 19106121]
Johansson P, Häkkinen J. Improving missing value imputation of microarray data by using spot quality weights. BMC Bioinformatics 2006; 7(1): 306.
[] [PMID: 16780582]
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15(2): R29.
[] [PMID: 24485249]
Liu R, Holik AZ, Su S, et al. Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res 2015; 43(15): e97.
[] [PMID: 25925576]
Kumar G, Ertel A, Feldman G, Kupper J, Fortina P. iSeqQC: a tool for expression-based quality control in RNA sequencing. BMC Bioinformatics 2020; 21(1): 56.
[] [PMID: 32054449]
Raman T, O’Connor TP, Hackett NR, et al. Quality control in microarray assessment of gene expression in human airway epithelium. BMC Genomics 2009; 10(1): 493.
[] [PMID: 19852842]
Bissels U, Wild S, Tomiuk S, et al. Absolute quantification of microRNAs by using a universal reference. RNA 2009; 15(12): 2375-84.
[] [PMID: 19861428]
Yang J. Identification of novel biomarkers, MUC5AC, MUC1, KRT7, GAPDH, CD44 for gastric cancer. Med Oncol 2020; 37(5): 34.
[] [PMID: 32219571]
Gui H, Gong Q, Jiang J, Liu M, Li H. Identification of the hub genes in Alzheimer’s disease. Comput Math Methods Med 2021; 2021: 1-8.
[] [PMID: 34326892]
Bednarz-Misa I, Neubauer K, Zacharska E, Kapturkiewicz B, Krzystek-Korpacka M. Whole blood ACTB, B2M and GAPDH expression reflects activity of inflammatory bowel disease, advancement of colorectal cancer, and correlates with circulating inflammatory and angiogenic factors: Relevance for real time quantitative PCR. Adv Clin Exp Med 2020; 29(5): 547-56.
[] [PMID: 32424999]
Valenti MT, Bertoldo F, Dalle Carbonare L, et al. The effect of bisphosphonates on gene expression: GAPDH as a housekeeping or a new target gene? BMC Cancer 2006; 6(1): 49.
[] [PMID: 16515701]
Hansen CN, Ketabi Z, Rosenstierne MW, Palle C, Boesen HC, Norrild B. Expression of CPEB, GAPDH and U6snRNA in cervical and ovarian tissue during cancer development. Acta Pathol Microbiol Scand Suppl 2009; 117(1): 53-9.
[] [PMID: 19161537]
Deindl E, Boengler K, van Royen N, Schaper W. Differential expression of GAPDH and beta3-actin in growing collateral arteries. Mol Cell Biochem 2002; 236(1/2): 139-46.
[] [PMID: 12190113]
Barry R, Diggle T, Terrett J, Soloviev M. Competitive assay formats for high-throughput affinity arrays. SLAS Discov 2003; 8(3): 257-63.
[] [PMID: 12857379]
Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann SA. RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol 2011; 7(1): 497.
[] [PMID: 21654674]
Piccolo SR, Withers MR, Francis OE, Bild AH, Johnson WE. Multiplatform single-sample estimates of transcriptional activation. Proceedings of the National Academy of Sciences - PNAS 110: 17778-83.
Ghavi-Helm Y, Klein FA, Pakozdi T, et al. Enhancer loops appear stable during development and are associated with paused polymerase. Nature 2014; 512(7512): 96-100.
[] [PMID: 25043061]
Lauria A, Peirone S, Giudice MD, et al. Identification of altered biological processes in heterogeneous RNA-sequencing data by discretization of expression profiles. Nucleic Acids Res 2020; 48(4): 1730-47.
[] [PMID: 31889184]
Nagaraj N, Wisniewski JR, Geiger T, et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 2011; 7(1): 548.
[] [PMID: 22068331]
Hartwell LH, Kastan MB. Cell cycle control and cancer. Science 1994; 266(5192): 1821-8.
[] [PMID: 7997877]
Sager R. Expression genetics in cancer: Shifting the focus from DNA to RNA. Proc Natl Acad Sci USA 1997; 94(3): 952-5.
[] [PMID: 9023363]
Croce CM. Oncogenes and cancer. N Engl J Med 2008; 358(5): 502-11.
[] [PMID: 18234754]
Matziari M, Dive V, Yiotakis A. Matrix metalloproteinase 11 (MMP-11; stromelysin-3) and synthetic inhibitors. Med Res Rev 2007; 27(4): 528-52.
[] [PMID: 16710861]
An X, Xu F, Luo R, et al. The prognostic significance of topoisomerase II alpha protein in early stage luminal breast cancer. BMC Cancer 2018; 18(1): 331.
[] [PMID: 29587760]
Busse DC, Habgood-Coote D, Clare S, et al. Interferon-induced protein 44 and interferon-induced protein 44-like restrict replication of respiratory syncytial virus. J Virol 2020; 94(18): e00297-20.
[] [PMID: 32611756]
Nallanthighal S, Heiserman JP, Cheon DJ. Collagen Type XI Alpha 1 (COL11A1): A novel biomarker and a key player in cancer. Cancers (Basel) 2021; 13(5): 935.
[] [PMID: 33668097]
Gibbons JA, Kanwar RK, Kanwar JR. Lactoferrin and cancer in different cancer models. Front Biosci (Schol Ed) 2011; S3(1): 1080-8.
[] [PMID: 21622257]
Cho SH, Kuo IY, Lu PJF, et al. Rab37 mediates exocytosis of secreted frizzled-related protein 1 to inhibit Wnt signaling and thus suppress lung cancer stemness. Cell Death Dis 2018; 9(9): 868.
[] [PMID: 30158579]
Guaita-Esteruelas S, Gumà J, Masana L, Borràs J. The peritumoural adipose tissue microenvironment and cancer. The roles of fatty acid binding protein 4 and fatty acid binding protein 5. Mol Cell Endocrinol 2018; 462(Pt B): 107-8.
[] [PMID: 28163102]
Rassart E, Desmarais F, Najyb O, Bergeron KF, Mounier C, Apolipoprotein D. Apolipoprotein D. Gene 2020; 756: 144874.
[] [PMID: 32554047]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy