ESTIMATION OF EXTREME QUANTILES: EMPIRICAL TOOLS FOR METHODS ASSESSMENT AND COMPARISON

Author(s):  
J. DIEBOLT ◽  
M.-A. EL-AROUI ◽  
V. DURBEC ◽  
B. VILLAIN

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.

Author(s):  
R.V. Dutaut ◽  
D. Marcotte

SYNOPSIS In most exploration or mining grade data-sets, the presence of outliers or extreme values represents a significant challenge to mineral resource estimators. The most common practice is to cap the extreme values at a predefined level. A new capping approach is presented that uses QA/QC coarse duplicate data correlation to predict the real data coefficient of variation (i.e., error-free CV). The cap grade is determined such that the capped data has a CV equal to the predicted CV. The robustness of the approach with regard to original core assay length decisions, departure from lognormality, and capping before or after compositing is assessed using simulated data-sets. Real case studies of gold and nickel deposits are used to compare the proposed approach to the methods most widely used in industry. The new approach is simple and objective. It provides a cap grade that is determined automatically, based on predicted CV, and takes into account the quality of the assay procedure as determined by coarse duplicates correlation. Keywords: geostatistics, outliers, capping, duplicates, QA/QC, lognormal distribution.


2005 ◽  
Vol 30 (4) ◽  
pp. 369-396 ◽  
Author(s):  
Eisuke Segawa

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.


2011 ◽  
Vol 76 (3) ◽  
pp. 547-572 ◽  
Author(s):  
Charles Perreault

I examine how our capacity to produce accurate culture-historical reconstructions changes as more archaeological sites are discovered, dated, and added to a data set. More precisely, I describe, using simulated data sets, how increases in the number of known sites impact the accuracy and precision of our estimations of (1) the earliest and (2) latest date of a cultural tradition, (3) the date and (4) magnitude of its peak popularity, as well as (5) its rate of spread and (6) disappearance in a population. I show that the accuracy and precision of inferences about these six historical processes are not affected in the same fashion by changes in the number of known sites. I also consider the impact of two simple taphonomic site destruction scenarios on the results. Overall, the results presented in this paper indicate that unless we are in possession of near-total samples of sites, and can be certain that there are no taphonomic biases in the universe of sites to be sampled, we will make inferences of varying precision and accuracy depending on the aspect of a cultural trait’s history in question.


2018 ◽  
Vol 9 (2) ◽  
pp. 170-188 ◽  
Author(s):  
Miguel Torres-Ruiz ◽  
Marco Moreno-Ibarra ◽  
Wadee Alhalabi ◽  
Rolando Quintero ◽  
Giovanni Guzmán

Purpose Up-to-date, the simulation of pedestrian behavior is used to support the design and analysis of urban infrastructure and public facilities. The purpose of this paper is to present a microscopic model that describes pedestrian behavior in a two-dimensional space. It is based on multi-agent systems and cellular automata theory. The concept of layered-intelligent terrain from the video game industry is reused and concepts such as tracing, evasion and rejection effects related to pedestrian interactive behavior are involved. In a simulation scenario, an agent represents a pedestrian with homogeneous physical characteristics such as walking speed and height. The agents are moved through a discrete space formed by a lattice of hexagonal cells, where each one can contain up to one agent at the same time. The model was validated by using a test that is composed of 17 real data sets of pedestrian unidirectional flow. Each data set has been extracted from laboratory-controlled scenarios carried out with up to 400 people walking through a corridor whose configuration changed in form of the amplitude of its entrance doors and the amplitude of its exit doors from one experiment to another. Moreover, each data set contained different groups of coordinates that compose pedestrian trajectories. The scenarios were replicated and simulated using the proposed model, obtaining 17 simulated data sets. In addition, a measurement methodology based on Voronoi diagrams was used to compute the velocity, density and specific flow of pedestrians to build a time-series graphic and a set of heat maps for each of the real and simulated data sets. Design methodology/approach The approach consists of a multi-agent system and cellular automata theory. The obtained results were compared with other studies and a statistical analysis based on similarity measurement is presented. Findings A microscopic mobility model that describes pedestrian behavior in a two-dimensional space is presented. It is based on multi-agent systems and cellular automata theory. The concept of layered-intelligent terrain from the video game industry is reused and concepts such as tracing, evasion and rejection effects related to pedestrian interactive behavior are involved. On average, the simulated data sets are similar by 82 per cent in density and 62 per cent in velocity compared to the real data sets. It was observed that the relation between velocity and density from real scenarios could not be replicated. Research limitations/implications The main limitations are presented in the speed simulations. Although the obtained results present a similar behavior to the reality, it is necessary to introduce more variables in the model to improve the precision and calibration. Other limitation is the dimension for simulating variables at this moment 2D is presented. So the resolution of cells, making that pedestrian to occupy many cells at the same time and the addition of three dimensions to the terrain will be a good challenge. Practical implications In total, 17 data sets were generated as a case study. They contain information related to speed, trajectories, initial and ending points. The data sets were used to calibrate the model and analyze the behavior of pedestrians. Geospatial data were used to simulate the public infrastructure in which pedestrians navigate, taking into account the initial and ending points. Social implications The social impact is directly related to the behavior analysis of pedestrians to know tendencies, trajectories and other features that aid to improve the public facilities. The results could be used to generate policies oriented toward developing more consciousness in the public infrastructure development. Originality/value The general methodology is the main value of this work. Many approaches were used, designed and implemented for analyzing the pedestrians’ behavior. In addition, all the methods were implemented in plug-in for Quantum GIS. The analysis was described with heat maps and statistical approaches. In addition, the obtained results are focused on analyzing the density, speed and the relationship between these features.


IUCrJ ◽  
2015 ◽  
Vol 2 (3) ◽  
pp. 352-360 ◽  
Author(s):  
Petr V. Konarev ◽  
Dmitri I. Svergun

Small-angle X-ray and neutron scattering (SAXS and SANS) experiments on solutions provide rapidly decaying scattering curves, often with a poor signal-to-noise ratio, especially at higher angles. On modern instruments, the noise is partially compensated for by oversampling, thanks to the fact that the angular increment in the data is small compared with that needed to describe adequately the local behaviour and features of the scattering curve. Given a (noisy) experimental data set, an important question arises as to which part of the data still contains useful information and should be taken into account for the interpretation and model building. Here, it is demonstrated that, for monodisperse systems, the useful experimental data range is defined by the number of meaningful Shannon channels that can be determined from the data set. An algorithm to determine this number and thus the data range is developed, and it is tested on a number of simulated data sets with various noise levels and with different degrees of oversampling, corresponding to typical SAXS/SANS experiments. The method is implemented in a computer program and examples of its application to analyse the experimental data recorded under various conditions are presented. The program can be employed to discard experimental data containing no useful information in automated pipelines, in modelling procedures, and for data deposition or publication. The software is freely accessible to academic users.


2021 ◽  
Vol 9 (1) ◽  
pp. 62-81
Author(s):  
Kjersti Aas ◽  
Thomas Nagler ◽  
Martin Jullum ◽  
Anders Løland

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.


2020 ◽  
Vol 21 (2) ◽  
Author(s):  
Bogumiła Hnatkowska ◽  
Zbigniew Huzar ◽  
Lech Tuzinkiewicz

A conceptual model is a high-level, graphical representation of a specic do-main, presenting its key concepts and relationships between them. In particular, these dependencies can be inferred from concepts' instances being a part of big raw data les. The paper aims to propose a method for constructing a conceptual model from data frames encompassed in data les. The result is presented in the form of a class diagram. The method is explained with several examples and veried by a case study in which the real data sets are processed. It can also be applied for checking the quality of the data set.


2021 ◽  
Vol 12 ◽  
Author(s):  
Li Xu ◽  
Yin Xu ◽  
Tong Xue ◽  
Xinyu Zhang ◽  
Jin Li

Motivation: The emergence of single-cell RNA sequencing (scRNA-seq) technology has paved the way for measuring RNA levels at single-cell resolution to study precise biological functions. However, the presence of a large number of missing values in its data will affect downstream analysis. This paper presents AdImpute: an imputation method based on semi-supervised autoencoders. The method uses another imputation method (DrImpute is used as an example) to fill the results as imputation weights of the autoencoder, and applies the cost function with imputation weights to learn the latent information in the data to achieve more accurate imputation.Results: As shown in clustering experiments with the simulated data sets and the real data sets, AdImpute is more accurate than other four publicly available scRNA-seq imputation methods, and minimally modifies the biologically silent genes. Overall, AdImpute is an accurate and robust imputation method.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Qiang Yu ◽  
Hongwei Huo ◽  
Dazheng Feng

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


2017 ◽  
Vol 607 ◽  
pp. A95 ◽  
Author(s):  
◽  
N. Aghanim ◽  
Y. Akrami ◽  
M. Ashdown ◽  
J. Aumont ◽  
...  

The six parameters of the standard ΛCDM model have best-fit values derived from the Planck temperature power spectrum that are shifted somewhat from the best-fit values derived from WMAP data. These shifts are driven by features in the Planck temperature power spectrum at angular scales that had never before been measured to cosmic-variance level precision. We have investigated these shifts to determine whether they are within the range of expectation and to understand their origin in the data. Taking our parameter set to be the optical depth of the reionized intergalactic medium τ, the baryon density ωb, the matter density ωm, the angular size of the sound horizon θ∗, the spectral index of the primordial power spectrum, ns, and Ase− 2τ (where As is the amplitude of the primordial power spectrum), we have examined the change in best-fit values between a WMAP-like large angular-scale data set (with multipole moment ℓ < 800 in the Planck temperature power spectrum) and an all angular-scale data set (ℓ < 2500Planck temperature power spectrum), each with a prior on τ of 0.07 ± 0.02. We find that the shifts, in units of the 1σ expected dispersion for each parameter, are { Δτ,ΔAse− 2τ,Δns,Δωm,Δωb,Δθ∗ } = { −1.7,−2.2,1.2,−2.0,1.1,0.9 }, with a χ2 value of 8.0. We find that this χ2 value is exceeded in 15% of our simulated data sets, and that a parameter deviates by more than 2.2σ in 9% of simulated data sets, meaning that the shifts are not unusually large. Comparing ℓ < 800 instead to ℓ> 800, or splitting at a different multipole, yields similar results. We examined the ℓ < 800 model residuals in the ℓ> 800 power spectrum data and find that the features there that drive these shifts are a set of oscillations across a broad range of angular scales. Although they partly appear similar to the effects of enhanced gravitational lensing, the shifts in ΛCDM parameters that arise in response to these features correspond to model spectrum changes that are predominantly due to non-lensing effects; the only exception is τ, which, at fixed Ase− 2τ, affects the ℓ> 800 temperature power spectrum solely through the associated change in As and the impact of that on the lensing potential power spectrum. We also ask, “what is it about the power spectrum at ℓ < 800 that leads to somewhat different best-fit parameters than come from the full ℓ range?” We find that if we discard the data at ℓ < 30, where there is a roughly 2σ downward fluctuation in power relative to the model that best fits the full ℓ range, the ℓ < 800 best-fit parameters shift significantly towards the ℓ < 2500 best-fit parameters. In contrast, including ℓ < 30, this previously noted “low-ℓ deficit” drives ns up and impacts parameters correlated with ns, such as ωm and H0. As expected, the ℓ < 30 data have a much greater impact on the ℓ < 800 best fit than on the ℓ < 2500 best fit. So although the shifts are not very significant, we find that they can be understood through the combined effects of an oscillatory-like set of high-ℓ residuals and the deficit in low-ℓ power, excursions consistent with sample variance that happen to map onto changes in cosmological parameters. Finally, we examine agreement between PlanckTT data and two other CMB data sets, namely the Planck lensing reconstruction and the TT power spectrum measured by the South Pole Telescope, again finding a lack of convincing evidence of any significant deviations in parameters, suggesting that current CMB data sets give an internally consistent picture of the ΛCDM model.


Sign in / Sign up

Export Citation Format

Share Document