A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis

Author(s): Nishith Kumar*, Md. Aminul Hoque, Md. Shahjaman, S.M. Shahinul Islam, Md. Nurul Haque Mollah

Journal Name: Current Bioinformatics

Volume 14 , Issue 1 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: Metabolomics data generation and quantification are different from other types of molecular “omics” data in bioinformatics. Mass spectrometry (MS) based (gas chromatography mass spectrometry (GC-MS), liquid chromatography mass spectrometry (LC-MS), etc.) metabolomics data frequently contain missing values that make some quantitative analysis complex. Typically metabolomics datasets contain 10% to 20% missing values that originate from several reasons, like analytical, computational as well as biological hazard. Imputation of missing values is a very important and interesting issue for further metabolomics data analysis.

Objective: This paper introduces a new algorithm for missing value imputation in the presence of outliers for metabolomics data analysis.

Method: Currently, the most well known missing value imputation techniques in metabolomics data are knearest neighbours (kNN), random forest (RF) and zero imputation. However, these techniques are sensitive to outliers. In this paper, we have proposed an outlier robust missing imputation technique by minimizing twoway empirical mean absolute error (MAE) loss function for imputing missing values in metabolomics data.

Results: We have investigated the performance of the proposed missing value imputation technique in a comparison of the other traditional imputation techniques using both simulated and real data analysis in the absence and presence of outliers.

Conclusion: Results of both simulated and real data analyses show that the proposed outlier robust missing imputation technique is better performer than the traditional missing imputation methods in both absence and presence of outliers.

Keywords: Metabolomics, missing data, missing value imputation, singular value decomposition (SVD), receiver operating characteristic (ROC) curve, support vector machine.

Gromski PS, Xu Y, Kotze HL, et al. Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites 2014; 4(2): 433-52.
Xia J, Psychogios N, Young N, Wishart DS. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 2009; 37(2): W652-60.
Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods 2002; 7(2): 147-77.
Hrydziuszko O, Viant MR. Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics 2012; 8(1): 161-74.
Steuer R, Morgenthal K, Weckwerth W, Selbig J. A gentle guide to the analysis of metabolomic data In: Weckwerth W, EdMetabolomics Methods in Molecular Biology™ Vol 358. Humana Press 2007; pp. 105-26.
De Ligny CL, Nieuwdorp GH, Brederode WK, Hammers WE, van Houwelingen JC. An application of factor analysis with missing data. Technometrics 1981; 23(1): 91-5.
Duran AL, Yang J, Wang L, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 2003; 19(17): 2283-93.
Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons 2014.
Shrive FM, Stuart H, Quan H, Ghali WA. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol 2006; 6(1): 57.
Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcaMethods-a bioconductor package providing PCA methods for incomplete data. Bioinformatics 2007; 23(9): 1164-7.
Walczak B, Massart DL. Dealing with missing data: Part I. Chemometr Intell Lab 2001; 58(1): 15-27.
Walczak B, Massart DL. Dealing with missing data: Part II. Chemometr Intell Lab 2001; 58(1): 29-42.
Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol 2004; 22(5): 245-52.
Aittokallio T. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 2010; 11(2): 253-64.
Goodacre R, Broadhurst D, Smilde AK, et al. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 2007; 3(3): 231-41.
Hair JF, Anderson RE, Babin BJ, Black WC. Multivariate data analysis: A global perspective. Upper Saddle River, NJ: Pearson 2010.
Zhan X, Patterson AD, Ghosh D. Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics 2015; 16(1): 77.
Steinfath M, Groth D, Lisec J, Selbig J. Metabolite profile analysis: from raw data to regression and classification. Physiol Plant 2008; 132(2): 150-61.
Lin TH. A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Qual Quant 2010; 44(2): 277-87.
Blanchet L, Smolinska A. Data fusion in metabolomics and proteomics for biomarker discovery. Stat Anal Proteom 2016; pp. pp. 209-223.
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17(6): 520-5.
Stekhoven DJ, Bühlmann P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012; 28(1): 112-8.
Breiman L. Random forests. Mach Learn 2001; 45(1): 5-32.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 1977; 1-38.
Rubin DB. Multiple imputation for nonresponse in surveys. John Wiley & Sons 2004.
McLachlan G, Krishnan T. The EM algorithm and extensions. John Wiley & Sons 2007.
Roweis S. EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst 1998; 626-32.
Thanoon FH. Robust Regression by Least Absolute Deviations Method. Int J Stat Appl 2015; 5(3): 109-12.
Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics 2007; 8(1): 2-8.
Kotze HL, Armitage EG, Sharkey KJ, et al. A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions. BMC Syst Biol 2013; 7(1): 107.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [43 - 52]
Pages: 10
DOI: 10.2174/1574893612666171121154655
Price: $65

Article Metrics

PDF: 49