Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Identifying Extreme Observations, Outliers and Noise in Clinical and Genetic Data

Author(s): Concepcion Arenas, Claudio Toma, Bru Cormand and Itziar Irigoien

Volume 12, Issue 2, 2017

Page: [101 - 117] Pages: 17

DOI: 10.2174/1574893611666160606161031

Price: $65

Abstract

Background: Currently, a major challenge is the treatment and interpretation of actual data. Data sets are often high-dimensional, have small number of observations and are noisy. Furthermore, in recent years, many approaches have been suggested for integrating continuous with categorical/ordinal data, in order to capture the information which is lost in independent studies.

Objective: The aim of this paper is to develop a statistical tool for the detection of outliers adapted to any kind of features and to high-dimensional data.

Method: Data is an nxp data matrix (n<<p) where the rows correspond to observations, the columns correspond to any kind of features. The new procedure is based on the distances between all the observations and offers a ranking by assigning each observation a value reflecting its degree of outlyingness. It was evaluated by simulation and by using actual data from clinical and genetic studies.

Results: The simulation studies showed that the procedure correctly identified the outliers, was robust in front of the masking effect and was useful in the detection of noise. With simulated two-sample microarray data sets, it correctly detected outliers, especially when many genes showed increased expression only for a small number of samples. The method was applied to adult lymphoid malignancies, human liver cancer and autism multiplex families’ data sets obtaining good and valuable results.

Conclusion: The actual and simulation studies show the efficiency of the procedure, offering a useful tool in those applications where the detection of outliers or noise is relevant.

Keywords: Biomedical data, data depth, gene expression, microarray, noise, outlier, robust estimation.

Next »
Graphical Abstract

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy