scholarly journals A Review on Outliers-Detection Methods for Multivariate Data

2021 ◽  
Vol 3 (1) ◽  
pp. 1-15
Author(s):  
Sharifah Sakinah Syed Abd Mutalib ◽  
Siti Zanariah Satari ◽  
Wan Nur Syahidah Wan Yusoff

Data in practice are often of high dimension and multivariate in nature. Detection of outliers has been one of the problems in multivariate analysis. Detecting outliers in multivariate data is difficult and it is not sufficient by using only graphical inspection. In this paper, a nontechnical and brief outlier detection method for multivariate data which are projection pursuit method, methods based on robust distance and cluster analysis are reviewed. The strengths and weaknesses of each method are briefly discussed.

2006 ◽  
Vol 131 (6) ◽  
pp. 770-779 ◽  
Author(s):  
Santiago Pereira-Lorenzo ◽  
María Belén Díaz-Hernández ◽  
Ana María Ramos-Cabrer

Morphological characters (six traits) and isozymes (four systems, five loci) were used to discriminate between Spanish chestnut cultivars (Castanea sativa Mill.) from the Iberian Peninsula. A total of 701 accessions (representing 168 local cultivars) were analyzed from collections made between 1989 and 2003 in the main chestnut growing areas: 31 were from Andalucía (12 cultivars), 293 from Asturias (65 cultivars), 25 from Castilla-León (nine cultivars), four from Extremadura (two cultivars) and 348 from Galicia (80 cultivars). Data were synthesized using multivariate analysis, principal component analysis, and cluster analysis. A total of 152 Spanish cultivars were verified: 58 cultivars of major importance and 94 of minor importance, of which 18 had high intracultivar variation. Thirty-seven cultivars were clustered into 14 synonymous groups. Six of these were from Galicia, one from Castilla-León (El Bierzo), four from Asturias, one from Asturias and Castilla-León (El Bierzo), and two from Asturias, Castilla-León (El Bierzo), and Galicia. The chestnut cultivars from Galicia and Asturias were undifferentiated in genetic terms, indicating that they are not genetically isolated. Overall, chestnut cultivars from southern Spain showed the least variation. Many (58%) of Spanish cultivars produced more than 100 nuts/kg; removing this low market-value character will be a high priority. The data obtained will be of use in chestnut breeding programs in Spain and elsewhere.


2020 ◽  
Vol 4 (Supplement_2) ◽  
pp. 1174-1174
Author(s):  
Paraskevi Massara ◽  
Robert Bandsma ◽  
Celine Bourdon ◽  
Jonathon Maguire ◽  
Elena Comelli ◽  
...  

Abstract Objectives Eliminating anthropometry measurement error and employing outlier and biological implausible values (BIV) detection methods adapted to longitudinal measurements is important for the study of growth. This work aimed to review and assess the accuracy of the available BIV and outlier detection methods and propose a growth trajectory outlier detection method. Methods We included 2354 infants from the Applied Research Group for Kids (TARGet Kids! ) cohort-based in Toronto (ON, Canada) that recruits healthy children from birth to 5 years of age. We considered infants with at least 8 length and weight measurements available between the 1st and the 24th month of age. Weight-for-length z-scores (wflz) were calculated using the WHO growth standards. Outlier measurements were randomly introduced in 5% of the wflz measurements using a normal distribution (μ = 0, σ = 1). We employed 4 outlier detection methods; an empirical detection method for BIV using the cut-offs derived from the WHO Child Growth Standards, a clustering method, a method based on cluster prototypes for individual outlier measurements and a method based on cluster prototypes for entire growth trajectories. Each method was applied individually and evaluated using the sensitivity and specificity indexes based on the manually introduced outliers. We also calculated the Kappa statistic to evaluate the agreement of each method against the manual outliers. Results After excluding premature (<37 weeks), low birth weight (<1500 g) neonates and children with missing length and weight measurements, we analyzed 393 children with a total of 3144 measurements. Sensitivity and specificity for the four methods ranged between 4.4%–55.0% and 83.7% −99.7%, respectively, with kappa being non-significant (P > 0.05) only for the empirical. The clustering detection method reported a higher finding rate, while the empirical method found most of the BIV, but few of the rest of the outliers. Conclusions BIV account for a small portion of the possible outliers in growth datasets. We show that additional statistical or model-based methods are required for a more comprehensive outlier detection process, which has implications for growth analysis and nutritional assessment. Funding Sources Joannah and Brian Lawson Center for Child Nutrition, Connaught Fund, Onassis Foundation.


2021 ◽  
Vol 2138 (1) ◽  
pp. 012013
Author(s):  
Yongzhi Chen ◽  
Ziao Xu ◽  
Chaoqun Niu

Abstract In the research of flash flood disaster monitoring and early warning, the Internet of Things is widely used in real-time information collection. There are abnormal situations such as noise, repetition and errors in a large amount of data collected by sensors, which will lead to false alarm, lower prediction accuracy and other problems. Aiming at the characteristic that outliers flow of sensors will cause obvious fluctuation of information entropy, this paper proposes a local outlier detection method based on information entropy and optimized by sliding window and LOF (Local Outlier Factor). This method can be used to improve the data quality, thus improving the accuracy of disaster prediction. The method is applied to data stream processing of water sensor, and the experimental results show that the method can accurately detect outliers. Compared with the existing detection methods that only use data distance to determine, the test positive rate is improved and the false positive rate is reduced.


Author(s):  
Taegong Kim ◽  
Cheong Hee Park

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.


2013 ◽  
Vol 15 (2) ◽  
pp. 179
Author(s):  
Admir Antonio Betarelli Junior ◽  
Roberto Luís De Melo Monte-Mór ◽  
Rodrigo Ferreira Simões

O propósito deste trabalho é discutir a formação, produção e organização do espaço urbano no estado de São Paulo a partir do processo de interiorização da indústria paulista no final dos anos 1970. O lócus da análise é a indústria, uma vez que no enfoque contemporâneo o processo de industrialização sempre esteve articulado com a produção da espacialidade urbana. Conciliando o método diferencial-estrutural (shift-share), a Análise de Componentes Principais (ACP) e a análise de cluster, foi possível evidenciar que tal processo teve como resultado o fenômeno de urbanização extensiva. Os resultados “fotográficos” apontam que houve uma extensão virtual das condições gerais do tecido urbano-industrial de forma que centralidades polarizadoras e regiões circunvizinhas apresentam vantagens locacionais e competitivas, formando, assim, aglomerações urbanas no território paulista, principalmente, nas regiões beneficiadas pelo processo de interiorização da indústria. Palavras-chave: urbanização extensiva; análise multivariada; análise de cluster; método diferencial-estrutural; indústria; São Paulo. Abstract: The main aim of this paper is to discuss the formation, organization and production of urban areas in State of São Paulo (Brazil) in the variant of the process of industry’s internalization in the late ‘70s. As industrialization has always been linked to the production of urban spatiality in contemporary approach, the locus of analysis is the industry. Combining the method shift-share (Esteban-Marquillas), Principal Component Analysis (PCA) and cluster analysis, we noted evidence that this process has resulted in the phenomenon of extensive urbanization. The main findings of these applications (“photographic”) indicated that there was a virtual extension in general conditions of the urban-industrial fabric so that polarizing centralities and surrounding regions present locational and competitive advantages, forming, therefore, urban agglomerations in the territory of São Paulo, mainly in the regions benefiting with the process of industry’s internalization. Keywords: extensive urbanization; internalization of the industry; shift-share; multivariate analysis; São Paulo (Brazil).


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Saima Afzal ◽  
Ayesha Afzal ◽  
Muhammad Amin ◽  
Sehar Saleem ◽  
Nouman Ali ◽  
...  

Outlier detection is a challenging task especially when outliers are defined by rare combinations of multiple variables. In this paper, we develop and evaluate a new method for the detection of outliers in multivariate data that relies on Principal Components Analysis (PCA) and three-sigma limits. The proposed approach employs PCA to effectively perform dimension reduction by regenerating variables, i.e., fitted points from the original observations. The observations lying outside the three-sigma limits are identified as the outliers. This proposed method has been successfully employed to two real life and several artificially generated datasets. The performance of the proposed method is compared with some of the existing methods using different performance evaluation criteria including the percentage of correct classification, precision, recall, and F-measure. The supremacy of the proposed method is confirmed by abovementioned criteria and datasets. The F-measure for the first real life dataset is the highest, i.e., 0.6667 for the proposed method and 0.3333 and 0.4000 for the two existing approaches. Similarly, for the second real dataset, this measure is 0.8000 for the proposed approach and 0.5263 and 0.6315 for the two existing approaches. It is also observed by the simulation experiments that the performance of the proposed approach got better with increasing sample size.


Author(s):  
ZhongYu Zhou ◽  
DeChang Pi

Outlier detection is a common method for analyzing data streams. In the existing outlier detection methods, most of methods compute distance of points to solve certain specific outlier detection problems. However, these methods are computationally expensive and cannot process data streams quickly. The outlier detection method based on pattern mining resolves the aforementioned issues, but the existing methods are inefficient and cannot meet requirements of quickly mining data streams. In order to improve the efficiency of the method, a new outlier detection method is proposed in this paper. First, a fast minimal infrequent pattern mining method is proposed to mine the minimal infrequent pattern from data streams. Second, an efficient outlier detection algorithm based on minimal infrequent pattern is proposed for detecting the outliers in the data streams by mining minimal infrequent pattern. The algorithm proposed in this paper is demonstrated by real telemetry data of a satellite in orbit. The experimental results show that the proposed method not only can be applied to satellite outlier detection, but also is superior to the existing methods.


Sign in / Sign up

Export Citation Format

Share Document