Title:Mining Gene Expression Profile with Missing Values: An Integration of Kernel PCA and Robust Singular Values Decomposition
VOLUME: 14 ISSUE: 1
Author(s):Md. Saimul Islam, Md. Aminul Hoque*, Md. Sahidul Islam, Mohammad Ali, Md. Bipul Hossen, Md. Binyamin, Amir Feisal Merican, Kohei Akazawa, Nishith Kumar and Masahiro Sugimoto
Affiliation:Department of Statistics, University of Rajshahi, Rajshahi-6205, Department of Statistics, University of Rajshahi, Rajshahi-6205, Department of Statistics, University of Rajshahi, Rajshahi-6205, Statistics Discipline, Khulna University, Khulna-9208, Department of Statistics, Begum Rokeya University, Rangpur-5400, Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail-1902, Institute of Biological Sciences, Faculty of Science and Centre of Research for Computational Sciences & Informatics for Biology, Bioindustry, Environment, Agriculture, and Healthcare (CRYSTAL), University of Malaya, Kuala Lumpur- 50603, Department of Medical Informatics, Niigata University Medical and Dental Hospital, Asahimachidori 1-754, Niigata 951-8520, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University,Gopalganj, Department of Statistics, University of Rajshahi, Rajshahi-6205
Keywords:Gene expression profile, simulation, GE biplot, Kernel principal component analysis, singular value
decomposition.
Abstract:Background: Gene expression profiling and transcriptomics provide valuable information
about the role of genes that are differentially expressed between two or more samples. It is always
important and challenging to analyse High-throughput DNA microarray data with a number of missing
values under various experimental conditions.
Objectives: Graphical data visualizations of the expression of all genes in a particular cell provide
holistic views of gene expression patterns, which improve our understanding of cellular systems under
normal and pathological conditions. However, current visualization methods are sensitive to missing
values, which are frequently observed in microarray-based gene expression profiling, potentially
affecting the subsequent statistical analyses.
Methods: We addressed in this study the problem of missing values with respect to different imputation
methods using gene expression biplot (GE biplot), one of the most popular gene visualization
techniques. The effects of missing values for mining differentially expressed genes in gene expression
data were evaluated using four well-known imputation methods: Robust Singular Value Decomposition
(Robust SVD), Column Average (CA), Column Median (CM), and K-nearest Neighbors (KNN).
Frobenius norm and absolute distances were used to measure the accuracy of the methods.
Results: Three numerical experiments were performed using simulated data (i) and publicly available colon
cancer (ii) and leukemia data (iii) to analyze the performance of each method. The results showed that CM and
KNN performed better than Robust SVD and CA for identifying the index gene profile in the biplot
visualization in both the simulation study and the colon cancer and leukemia microarray datasets.
Conclusion: The impact of missing values on the GE biplot was smaller when the data matrix was
imputed by KNN than by CM. This study concluded that KNN performed satisfactorily in generating a
GE biplot in the presence of missing values in microarray data.