Background: Currently, a major challenge is the treatment and interpretation of actual data. Data
sets are often high-dimensional, have small number of observations and are noisy. Furthermore, in recent
years, many approaches have been suggested for integrating continuous with categorical/ordinal data, in
order to capture the information which is lost in independent studies.
Objective: The aim of this paper is to develop a statistical tool for the detection of outliers adapted to any
kind of features and to high-dimensional data.
Method: Data is an nxp data matrix (n<<p) where the rows correspond to observations, the columns
correspond to any kind of features. The new procedure is based on the distances between all the observations
and offers a ranking by assigning each observation a value reflecting its degree of outlyingness. It was
evaluated by simulation and by using actual data from clinical and genetic studies.
Results: The simulation studies showed that the procedure correctly identified the outliers, was robust in
front of the masking effect and was useful in the detection of noise. With simulated two-sample microarray
data sets, it correctly detected outliers, especially when many genes showed increased expression only for a
small number of samples. The method was applied to adult lymphoid malignancies, human liver cancer and
autism multiplex families’ data sets obtaining good and valuable results.
Conclusion: The actual and simulation studies show the efficiency of the procedure, offering a useful tool in
those applications where the detection of outliers or noise is relevant.