scholarly journals Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11436
Author(s):  
Thomas R. Etherington

The Mahalanobis distance is a statistical technique that has been used in statistics and data science for data classification and outlier detection, and in ecology to quantify species-environment relationships in habitat and ecological niche models. Mahalanobis distances are based on the location and scatter of a multivariate normal distribution, and can measure how distant any point in space is from the centre of this kind of distribution. Three different methods for calculating the multivariate location and scatter are commonly used: the sample mean and variance-covariance, the minimum covariance determinant, and the minimum volume ellipsoid. The minimum covariance determinant and minimum volume ellipsoid were developed to be robust to outliers by minimising the multivariate location and scatter for a subset of the full sample, with the proportion of the full sample forming the subset being controlled by a user-defined parameter. This outlier robustness means the minimum covariance determinant and the minimum volume ellipsoid are highly relevant for ecological niche analyses, which are usually based on natural history observations that are likely to contain errors. However, natural history observations will also contain extreme bias, to which the minimum covariance determinant and the minimum volume ellipsoid will also be sensitive. To provide guidance for selecting and parameterising a multivariate location and scatter method, a series of virtual ecological niche modelling experiments were conducted to demonstrate the performance of each multivariate location and scatter method under different levels of sample size, errors, and bias. The results show that there is no optimal modelling approach, and that choices need to be made based on the individual data and question. The sample mean and variance-covariance method will perform best on very small sample sizes if the data are free of error and bias. At larger sample sizes the minimum covariance determinant and minimum volume ellipsoid methods perform as well or better, but only if they are appropriately parameterised. Modellers who are more concerned about the prevalence of errors should retain a smaller proportion of the full data set, while modellers more concerned about the prevalence of bias should retain a larger proportion of the full data set. I conclude that Mahalanobis distances are a useful niche modelling technique, but only for questions relating to the fundamental niche of a species where the assumption of multivariate normality is reasonable. Users of the minimum covariance determinant and minimum volume ellipsoid methods must also clearly report their parameterisations so that the results can be interpreted correctly.

2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Ashkan Shabbak ◽  
Habshah Midi

The HotellingT2statistic is the most popular statistic used in multivariate control charts to monitor multiple qualities. However, this statistic is easily affected by the existence of more than one outlier in the data set. To rectify this problem, robust control charts, which are based on the minimum volume ellipsoid and the minimum covariance determinant, have been proposed. Most researchers assess the performance of multivariate control charts based on the number of signals without paying much attention to whether those signals are really outliers. With due respect, we propose to evaluate control charts not only based on the number of detected outliers but also with respect to their correct positions. In this paper, an Upper Control Limit based on the median and the median absolute deviation is also proposed. The results of this study signify that the proposed Upper Control Limit improves the detection of correct outliers but that it suffers from a swamping effect when the positions of outliers are not taken into consideration. Finally, a robust control chart based on the diagnostic robust generalised potential procedure is introduced to remedy this drawback.


2018 ◽  
Vol 14 (1) ◽  
pp. 46
Author(s):  
Erna Tri Herdiani

Outlier adalah suatu observasi yang polanya tidak mengikuti mayoritas data. Outlier dalam kasus multivariat sangat sulit untuk dideteksi, khususnya ketika dimensi lebih dari 2. Kesulitan ini meningkat ketika data set berukuran besar, yakni jumlah variabel menjadi besar. Metode-metode pendeteksian outlier telah lama berkembang dan beberapa digunakan untuk pelabelan outlier sehingga data dapat dipisahkan antara data yang dicurigai sebagai outlier dan data set pada umumnya. Metode-metode tersebut adalah minimum volume ellipsoid disingkat MVE, minimun covariance determinant disingkat MCD, dan minimum vector variance disingkat MVV. Dari ketiga metode tersebut MVV memiliki waktu perhitungan yang paling cepat. Berdasarkan algoritma MVV, kriteria mengurutkan data menggunakan jarak mahalanobis, maka pada paper ini akan dimodifikasi kriteria pengurutan data dengan menghindari penulisan dalam bentuk invers dari matriks variansi kovariansi. Hasil yang diperoleh adalah metode MVV menjadi lebih cepat dengan menggunakan kriteria baru dengan kecermatan yang sama dengan MVV sebelumnya serta akan diaplikan untuk data real dan data simulasi.


2013 ◽  
Vol 2013 ◽  
pp. 1-14 ◽  
Author(s):  
Asokan Mulayath Variyath ◽  
Jayasankar Vattathoor

Hoteling's T2 control charts are widely used in industries to monitor multivariate processes. The classical estimators, sample mean, and the sample covariance used in T2 control charts are highly sensitive to the outliers in the data. In Phase-I monitoring, control limits are arrived at using historical data after identifying and removing the multivariate outliers. We propose Hoteling's T2 control charts with high-breakdown robust estimators based on the reweighted minimum covariance determinant (RMCD) and the reweighted minimum volume ellipsoid (RMVE) to monitor multivariate observations in Phase-I data. We assessed the performance of these robust control charts based on a large number of Monte Carlo simulations by considering different data scenarios and found that the proposed control charts have better performance compared to existing methods.


2009 ◽  
Author(s):  
Renata L Stange ◽  
Fabiana S Santana ◽  
Bruna Buani ◽  
Pedro L. P Correa ◽  
Antonio M Saraiva

Sign in / Sign up

Export Citation Format

Share Document