Identification of Multivariate Outliers: A Performance Study

Three methods for the identification of multivariate outliers (Rousseeuw and Van Zomeren, 1990; Becker and Gather, 1999; Filzmoser et al., 2005) are compared. They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance. The comparison is made by means of a simulation study. Not only the case of multivariate normally distributed data, but also heavy tailed and asymmetric distributions will be considered. The simulations are focused on low dimensional (p = 5) and high dimensional (p = 30) data.

Download Full-text

Combining clinical and molecular data in regression prediction models: insights from a simulation study

Briefings in Bioinformatics ◽

10.1093/bib/bbz136 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1904-1919 ◽

Cited By ~ 2

Author(s):

Riccardo De Bin ◽

Anne-Laure Boulesteix ◽

Axel Benner ◽

Natalia Becker ◽

Willi Sauerbrei

Keyword(s):

Prediction Model ◽

Simulation Study ◽

Prediction Models ◽

Molecular Data ◽

Data Sources ◽

Correlation Structure ◽

High Dimensional ◽

Sources Of Information ◽

Gene Expressions ◽

Low Dimensional

Abstract Data integration, i.e. the use of different sources of information for data analysis, is becoming one of the most important topics in modern statistics. Especially in, but not limited to, biomedical applications, a relevant issue is the combination of low-dimensional (e.g. clinical data) and high-dimensional (e.g. molecular data such as gene expressions) data sources in a prediction model. Not only the different characteristics of the data, but also the complex correlation structure within and between the two data sources, pose challenging issues. In this paper, we investigate these issues via simulations, providing some useful insight into strategies to combine low- and high-dimensional data in a regression prediction model. In particular, we focus on the effect of the correlation structure on the results, while accounting for the influence of our specific choices in the design of the simulation study.

Download Full-text

High-dimensional data visualisation with the grand tour

EPJ Web of Conferences ◽

10.1051/epjconf/202024506018 ◽

2020 ◽

Vol 245 ◽

pp. 06018

Author(s):

Ursula Laa

Keyword(s):

Dimensional Space ◽

High Dimensional Data ◽

Projection Pursuit ◽

High Dimensional ◽

Grand Tour ◽

Multiple Parameters ◽

Visual Identification ◽

Multivariate Outliers ◽

Free Parameters ◽

Low Dimensional

In physics we often encounter high-dimensional data, in the form of multivariate measurements or of models with multiple free parameters. The information encoded is increasingly explored using machine learning, but is not typically explored visually. The barrier tends to be visualising beyond 3D, but systematic approaches for this exist in the statistics literature. I use examples from particle and astrophysics to show how we can use the “grand tour” for such multidimensional visualisations, for example to explore grouping in high dimension and for visual identification of multivariate outliers. I then discuss the idea of projection pursuit, i.e. searching the high-dimensional space for “interesting” low dimensional projections, and illustrate how we can detect complex associations between multiple parameters.

Download Full-text

An Alternative to Cohen's κ

European Psychologist ◽

10.1027/1016-9040.11.1.12 ◽

2006 ◽

Vol 11 (1) ◽

pp. 12-24 ◽

Cited By ~ 19

Author(s):

Alexander von Eye

Keyword(s):

Simulation Study ◽

Null Hypothesis ◽

Categorical Variables ◽

Alternative Measure ◽

Rater Agreement ◽

Verbal Processing ◽

Heavy Tailed ◽

Applicant Selection

At the level of manifest categorical variables, a large number of coefficients and models for the examination of rater agreement has been proposed and used. The most popular of these is Cohen's κ. In this article, a new coefficient, κ s , is proposed as an alternative measure of rater agreement. Both κ and κ s allow researchers to determine whether agreement in groups of two or more raters is significantly beyond chance. Stouffer's z is used to test the null hypothesis that κ s = 0. The coefficient κ s allows one, in addition to evaluating rater agreement in a fashion parallel to κ, to (1) examine subsets of cells in agreement tables, (2) examine cells that indicate disagreement, (3) consider alternative chance models, (4) take covariates into account, and (5) compare independent samples. Results from a simulation study are reported, which suggest that (a) the four measures of rater agreement, Cohen's κ, Brennan and Prediger's κ n , raw agreement, and κ s are sensitive to the same data characteristics when evaluating rater agreement and (b) both the z-statistic for Cohen's κ and Stouffer's z for κ s are unimodally and symmetrically distributed, but slightly heavy-tailed. Examples use data from verbal processing and applicant selection.

Download Full-text

Robust Estimation of a High-Dimensional Integrated Covariance Matrix

SSRN Electronic Journal ◽

10.2139/ssrn.1996261 ◽

2012 ◽

Author(s):

Takayuki Morimoto ◽

Shuichi Nagata

Keyword(s):

Covariance Matrix ◽

Robust Estimation ◽

High Dimensional ◽

Integrated Covariance

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text

A Nonlinear Maximum Correntropy Information Filter for High-Dimensional Neural Decoding

Entropy ◽

10.3390/e23060743 ◽

2021 ◽

Vol 23 (6) ◽

pp. 743

Author(s):

Xi Liu ◽

Shuhang Chen ◽

Xiang Shen ◽

Xiang Zhang ◽

Yiwen Wang

Keyword(s):

State Estimation ◽

Measurement Model ◽

High Dimensional ◽

Neural Firing ◽

The Neural Network ◽

Information Filter ◽

Critical Technology ◽

Dimensional Measurements ◽

Non Gaussian ◽

Low Dimensional

Neural signal decoding is a critical technology in brain machine interface (BMI) to interpret movement intention from multi-neural activity collected from paralyzed patients. As a commonly-used decoding algorithm, the Kalman filter is often applied to derive the movement states from high-dimensional neural firing observation. However, its performance is limited and less effective for noisy nonlinear neural systems with high-dimensional measurements. In this paper, we propose a nonlinear maximum correntropy information filter, aiming at better state estimation in the filtering process for a noisy high-dimensional measurement system. We reconstruct the measurement model between the high-dimensional measurements and low-dimensional states using the neural network, and derive the state estimation using the correntropy criterion to cope with the non-Gaussian noise and eliminate large initial uncertainty. Moreover, analyses of convergence and robustness are given. The effectiveness of the proposed algorithm is evaluated by applying it on multiple segments of neural spiking data from two rats to interpret the movement states when the subjects perform a two-lever discrimination task. Our results demonstrate better and more robust state estimation performance when compared with other filters.

Download Full-text

Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors

Journal of Econometrics ◽

10.1016/j.jeconom.2021.05.006 ◽

2021 ◽

Author(s):

Dongxiao Han ◽

Jian Huang ◽

Yuanyuan Lin ◽

Guohao Shen

Keyword(s):

High Dimensional ◽

Heavy Tailed

Download Full-text

PSS Business Case Map: Supporting Idea Generation in PSS Design

Volume 3: 38th Design Automation Conference, Parts A and B ◽

10.1115/detc2012-70692 ◽

2012 ◽

Cited By ~ 2

Author(s):

Fumiya Akasaka ◽

Kazuki Fujita ◽

Yoshiki Shimomura

Keyword(s):

Idea Generation ◽

Business Case ◽

Literature Survey ◽

The Self ◽

High Dimensional ◽

Self Organizing Map ◽

Two Dimensional ◽

Service Type ◽

Business Cases ◽

Low Dimensional

This paper proposes the PSS Business Case Map as a tool to support designers’ idea generation in PSS design. The map visualizes the similarities among PSS business cases in a two-dimensional diagram. To make the map, PSS business cases are first collected by conducting, for example, a literature survey. The collected business cases are then classified from multiple aspects that characterize each case such as its product type, service type, target customer, and so on. Based on the results of this classification, the similarities among the cases are calculated and visualized by using the Self-Organizing Map (SOM) technique. A SOM is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional) view from high-dimensional data. The visualization result is offered to designers in a form of a two-dimensional map, which is called the PSS Business Case Map. By using the map, designers can figure out the position of their current business and can acquire ideas for the servitization of their business.

Download Full-text

Perspectives of the high-dimensional dynamics of neural microcircuits from the point of view of low-dimensional readouts

Complexity ◽

10.1002/cplx.10089 ◽

2003 ◽

Vol 8 (4) ◽

pp. 39-50 ◽

Cited By ~ 11

Author(s):

Stefan Häusler ◽

Henry Markram ◽

Wolfgang Maass

Keyword(s):

Point Of View ◽

High Dimensional ◽

Low Dimensional

Download Full-text

Improved interactive color visualization approach for hyperspectral images

Information Visualization ◽

10.1177/14738716211048142 ◽

2021 ◽

pp. 147387162110481

Author(s):

Haijun Yu ◽

Shengyang Li

Keyword(s):

Real Time ◽

Hyperspectral Images ◽

High Dimensional ◽

Interactive Control ◽

Output Image ◽

Dr Method ◽

The Rich ◽

Low Dimensional ◽

Color Visualization ◽

Fusion Coefficient

Hyperspectral images (HSIs) have become increasingly prominent as they can maintain the subtle spectral differences of the imaged objects. Designing approaches and tools for analyzing HSIs presents a unique set of challenges due to their high-dimensional characteristics. An improved color visualization approach is proposed in this article to achieve communication between users and HSIs in the field of remote sensing. Under the real-time interactive control and color visualization, this approach can help users intuitively obtain the rich information hidden in original HSIs. Using the dimensionality reduction (DR) method based on band selection, high-dimensional HSIs are reduced to low-dimensional images. Through drop-down boxes, users can freely specify images that participate in the combination of RGB channels of the output image. Users can then interactively and independently set the fusion coefficient of each image within an interface based on concentric circles. At the same time, the output image will be calculated and visualized in real time, and the information it reflects will also be different. In this approach, channel combination and fusion coefficient setting are two independent processes, which allows users to interact more flexibly according to their needs. Furthermore, this approach is also applicable for interactive visualization of other types of multi-layer data.

Download Full-text