A Review on Outliers-Detection Methods for Multivariate Data

Sharifah Sakinah Syed Abd Mutalib; Siti Zanariah Satari; Wan Nur Syahidah Wan Yusoff

doi:10.22452/josma.vol3no1.1

A Review on Outliers-Detection Methods for Multivariate Data

Journal of Statistical Modelling and Analytics ◽

10.22452/josma.vol3no1.1 ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-15

Author(s):

Sharifah Sakinah Syed Abd Mutalib ◽

Siti Zanariah Satari ◽

Wan Nur Syahidah Wan Yusoff

Keyword(s):

Cluster Analysis ◽

Multivariate Analysis ◽

Outlier Detection ◽

High Dimension ◽

Detection Method ◽

Multivariate Data ◽

Projection Pursuit ◽

Detection Methods ◽

And Cluster Analysis ◽

Detection Of Outliers

Data in practice are often of high dimension and multivariate in nature. Detection of outliers has been one of the problems in multivariate analysis. Detecting outliers in multivariate data is difficult and it is not sufficient by using only graphical inspection. In this paper, a nontechnical and brief outlier detection method for multivariate data which are projection pursuit method, methods based on robust distance and cluster analysis are reviewed. The strengths and weaknesses of each method are briefly discussed.

Download Full-text

Use of Highly Discriminating Morphological Characters and Isozymes in the Study of Spanish Chestnut Cultivars

Journal of the American Society for Horticultural Science ◽

10.21273/jashs.131.6.770 ◽

2006 ◽

Vol 131 (6) ◽

pp. 770-779 ◽

Cited By ~ 21

Author(s):

Santiago Pereira-Lorenzo ◽

María Belén Díaz-Hernández ◽

Ana María Ramos-Cabrer

Keyword(s):

Principal Component Analysis ◽

Cluster Analysis ◽

Multivariate Analysis ◽

Castanea Sativa ◽

Principal Component ◽

Market Value ◽

Morphological Characters ◽

Minor Importance ◽

Breeding Programs ◽

And Cluster Analysis

Morphological characters (six traits) and isozymes (four systems, five loci) were used to discriminate between Spanish chestnut cultivars (Castanea sativa Mill.) from the Iberian Peninsula. A total of 701 accessions (representing 168 local cultivars) were analyzed from collections made between 1989 and 2003 in the main chestnut growing areas: 31 were from Andalucía (12 cultivars), 293 from Asturias (65 cultivars), 25 from Castilla-León (nine cultivars), four from Extremadura (two cultivars) and 348 from Galicia (80 cultivars). Data were synthesized using multivariate analysis, principal component analysis, and cluster analysis. A total of 152 Spanish cultivars were verified: 58 cultivars of major importance and 94 of minor importance, of which 18 had high intracultivar variation. Thirty-seven cultivars were clustered into 14 synonymous groups. Six of these were from Galicia, one from Castilla-León (El Bierzo), four from Asturias, one from Asturias and Castilla-León (El Bierzo), and two from Asturias, Castilla-León (El Bierzo), and Galicia. The chestnut cultivars from Galicia and Asturias were undifferentiated in genetic terms, indicating that they are not genetically isolated. Overall, chestnut cultivars from southern Spain showed the least variation. Many (58%) of Spanish cultivars produced more than 100 nuts/kg; removing this low market-value character will be a high priority. The data obtained will be of use in chestnut breeding programs in Spain and elsewhere.

Download Full-text

Outlier Detection in Growth Data: Beyond Biologically Implausible Values

Current Developments in Nutrition ◽

10.1093/cdn/nzaa056_021 ◽

2020 ◽

Vol 4 (Supplement_2) ◽

pp. 1174-1174

Author(s):

Paraskevi Massara ◽

Robert Bandsma ◽

Celine Bourdon ◽

Jonathon Maguire ◽

Elena Comelli ◽

...

Keyword(s):

Outlier Detection ◽

Sensitivity And Specificity ◽

Detection Method ◽

Nutritional Assessment ◽

Empirical Method ◽

Child Growth ◽

Detection Methods ◽

Healthy Children ◽

Growth Data ◽

Growth Standards

Abstract Objectives Eliminating anthropometry measurement error and employing outlier and biological implausible values (BIV) detection methods adapted to longitudinal measurements is important for the study of growth. This work aimed to review and assess the accuracy of the available BIV and outlier detection methods and propose a growth trajectory outlier detection method. Methods We included 2354 infants from the Applied Research Group for Kids (TARGet Kids! ) cohort-based in Toronto (ON, Canada) that recruits healthy children from birth to 5 years of age. We considered infants with at least 8 length and weight measurements available between the 1st and the 24th month of age. Weight-for-length z-scores (wflz) were calculated using the WHO growth standards. Outlier measurements were randomly introduced in 5% of the wflz measurements using a normal distribution (μ = 0, σ = 1). We employed 4 outlier detection methods; an empirical detection method for BIV using the cut-offs derived from the WHO Child Growth Standards, a clustering method, a method based on cluster prototypes for individual outlier measurements and a method based on cluster prototypes for entire growth trajectories. Each method was applied individually and evaluated using the sensitivity and specificity indexes based on the manually introduced outliers. We also calculated the Kappa statistic to evaluate the agreement of each method against the manual outliers. Results After excluding premature (<37 weeks), low birth weight (<1500 g) neonates and children with missing length and weight measurements, we analyzed 393 children with a total of 3144 measurements. Sensitivity and specificity for the four methods ranged between 4.4%–55.0% and 83.7% −99.7%, respectively, with kappa being non-significant (P > 0.05) only for the empirical. The clustering detection method reported a higher finding rate, while the empirical method found most of the BIV, but few of the rest of the outliers. Conclusions BIV account for a small portion of the possible outliers in growth datasets. We show that additional statistical or model-based methods are required for a more comprehensive outlier detection process, which has implications for growth analysis and nutritional assessment. Funding Sources Joannah and Brian Lawson Center for Child Nutrition, Connaught Fund, Onassis Foundation.

Download Full-text

Outlier Detection Method for Flash Flood Disaster Monitoring Data based on Information Entropy

Journal of Physics Conference Series ◽

10.1088/1742-6596/2138/1/012013 ◽

2021 ◽

Vol 2138 (1) ◽

pp. 012013

Author(s):

Yongzhi Chen ◽

Ziao Xu ◽

Chaoqun Niu

Keyword(s):

Outlier Detection ◽

Information Entropy ◽

Detection Method ◽

Flash Flood ◽

False Positive Rate ◽

Flood Disaster ◽

Detection Methods ◽

Positive Rate ◽

Disaster Monitoring ◽

Local Outlier

Abstract In the research of flash flood disaster monitoring and early warning, the Internet of Things is widely used in real-time information collection. There are abnormal situations such as noise, repetition and errors in a large amount of data collected by sensors, which will lead to false alarm, lower prediction accuracy and other problems. Aiming at the characteristic that outliers flow of sensors will cause obvious fluctuation of information entropy, this paper proposes a local outlier detection method based on information entropy and optimized by sliding window and LOF (Local Outlier Factor). This method can be used to improve the data quality, thus improving the accuracy of disaster prediction. The method is applied to data stream processing of water sensor, and the experimental results show that the method can accurately detect outliers. Compared with the existing detection methods that only use data distance to determine, the test positive rate is improved and the false positive rate is reduced.

Download Full-text

An Improved Outlier Detection Method in High-dimension Based on Weighted Hypergraph

2009 Second International Symposium on Electronic Commerce and Security ◽

10.1109/isecs.2009.54 ◽

2009 ◽

Author(s):

YinZhao Li ◽

Di Wu ◽

JiaDong Ren ◽

ChangZhen Hu

Keyword(s):

Outlier Detection ◽

High Dimension ◽

Detection Method ◽

Weighted Hypergraph

Download Full-text

A Novel Outlier Detection Method for Multivariate Data

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2020.3036524 ◽

2020 ◽

pp. 1-1

Author(s):

Yahya Almardeny ◽

Noureddine Boujnah ◽

Frances Cleary

Keyword(s):

Outlier Detection ◽

Detection Method ◽

Multivariate Data

Download Full-text

Anomaly Pattern Detection in Streaming Data Based on the Transformation to Multiple Binary-Valued Data Streams

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2022-0002 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-27

Author(s):

Taegong Kim ◽

Cheong Hee Park

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Detection Method ◽

Binary Classification ◽

Streaming Data ◽

Pattern Detection ◽

Detection Methods ◽

Anomaly Pattern ◽

Isolation Forest

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.

Download Full-text

Urbanização extensiva e o processo de interiorização do estado de São Paulo: um enfoque contemporâneo

Revista Brasileira de Estudos Urbanos e Regionais ◽

10.22296/2317-1529.2013v15n2p179 ◽

2013 ◽

Vol 15 (2) ◽

pp. 179

Author(s):

Admir Antonio Betarelli Junior ◽

Roberto Luís De Melo Monte-Mór ◽

Rodrigo Ferreira Simões

Keyword(s):

Principal Component Analysis ◽

Cluster Analysis ◽

Multivariate Analysis ◽

Urban Areas ◽

Principal Component ◽

Sao Paulo ◽

Competitive Advantages ◽

São Paulo ◽

And Cluster Analysis ◽

Contemporary Approach

O propósito deste trabalho é discutir a formação, produção e organização do espaço urbano no estado de São Paulo a partir do processo de interiorização da indústria paulista no final dos anos 1970. O lócus da análise é a indústria, uma vez que no enfoque contemporâneo o processo de industrialização sempre esteve articulado com a produção da espacialidade urbana. Conciliando o método diferencial-estrutural (shift-share), a Análise de Componentes Principais (ACP) e a análise de cluster, foi possível evidenciar que tal processo teve como resultado o fenômeno de urbanização extensiva. Os resultados “fotográficos” apontam que houve uma extensão virtual das condições gerais do tecido urbano-industrial de forma que centralidades polarizadoras e regiões circunvizinhas apresentam vantagens locacionais e competitivas, formando, assim, aglomerações urbanas no território paulista, principalmente, nas regiões beneficiadas pelo processo de interiorização da indústria. Palavras-chave: urbanização extensiva; análise multivariada; análise de cluster; método diferencial-estrutural; indústria; São Paulo. Abstract: The main aim of this paper is to discuss the formation, organization and production of urban areas in State of São Paulo (Brazil) in the variant of the process of industry’s internalization in the late ‘70s. As industrialization has always been linked to the production of urban spatiality in contemporary approach, the locus of analysis is the industry. Combining the method shift-share (Esteban-Marquillas), Principal Component Analysis (PCA) and cluster analysis, we noted evidence that this process has resulted in the phenomenon of extensive urbanization. The main findings of these applications (“photographic”) indicated that there was a virtual extension in general conditions of the urban-industrial fabric so that polarizing centralities and surrounding regions present locational and competitive advantages, forming, therefore, urban agglomerations in the territory of São Paulo, mainly in the regions benefiting with the process of industry’s internalization. Keywords: extensive urbanization; internalization of the industry; shift-share; multivariate analysis; São Paulo (Brazil).

Download Full-text

A Novel Approach for Outlier Detection in Multivariate Data

Mathematical Problems in Engineering ◽

10.1155/2021/1899225 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Saima Afzal ◽

Ayesha Afzal ◽

Muhammad Amin ◽

Sehar Saleem ◽

Nouman Ali ◽

...

Keyword(s):

Outlier Detection ◽

Multivariate Data ◽

Evaluation Criteria ◽

Real Life ◽

Simulation Experiments ◽

Novel Approach ◽

Multiple Variables ◽

Components Analysis ◽

Detection Of Outliers ◽

F Measure

Outlier detection is a challenging task especially when outliers are defined by rare combinations of multiple variables. In this paper, we develop and evaluate a new method for the detection of outliers in multivariate data that relies on Principal Components Analysis (PCA) and three-sigma limits. The proposed approach employs PCA to effectively perform dimension reduction by regenerating variables, i.e., fitted points from the original observations. The observations lying outside the three-sigma limits are identified as the outliers. This proposed method has been successfully employed to two real life and several artificially generated datasets. The performance of the proposed method is compared with some of the existing methods using different performance evaluation criteria including the percentage of correct classification, precision, recall, and F-measure. The supremacy of the proposed method is confirmed by abovementioned criteria and datasets. The F-measure for the first real life dataset is the highest, i.e., 0.6667 for the proposed method and 0.3333 and 0.4000 for the two existing approaches. Similarly, for the second real dataset, this measure is 0.8000 for the proposed approach and 0.5263 and 0.6315 for the two existing approaches. It is also observed by the simulation experiments that the performance of the proposed approach got better with increasing sample size.

Download Full-text

Outlier Detection Using Association Rule Mining and Cluster Analysis

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i6.529533 ◽

2018 ◽

Vol 6 (6) ◽

pp. 529-533

Author(s):

C. Leela Krishna ◽

C. Kala Krishna

Keyword(s):

Cluster Analysis ◽

Outlier Detection ◽

Association Rule ◽

Association Rule Mining ◽

Rule Mining ◽

And Cluster Analysis

Download Full-text

Data Streams Oriented Outlier Detection Method: A Fast Minimal Infrequent Pattern Mining

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/6/14 ◽

2021 ◽

Author(s):

ZhongYu Zhou ◽

DeChang Pi

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Pattern Mining ◽

Detection Method ◽

Detection Algorithm ◽

Detection Methods ◽

Mining Method ◽

Telemetry Data ◽

Process Data ◽

Mining Data Streams

Outlier detection is a common method for analyzing data streams. In the existing outlier detection methods, most of methods compute distance of points to solve certain specific outlier detection problems. However, these methods are computationally expensive and cannot process data streams quickly. The outlier detection method based on pattern mining resolves the aforementioned issues, but the existing methods are inefficient and cannot meet requirements of quickly mining data streams. In order to improve the efficiency of the method, a new outlier detection method is proposed in this paper. First, a fast minimal infrequent pattern mining method is proposed to mine the minimal infrequent pattern from data streams. Second, an efficient outlier detection algorithm based on minimal infrequent pattern is proposed for detecting the outliers in the data streams by mining minimal infrequent pattern. The algorithm proposed in this paper is demonstrated by real telemetry data of a satellite in orbit. The experimental results show that the proposed method not only can be applied to satellite outlier detection, but also is superior to the existing methods.

Download Full-text