scholarly journals Explaining predictive models using Shapley values and non-parametric vine copulas

2021 ◽  
Vol 9 (1) ◽  
pp. 62-81
Author(s):  
Kjersti Aas ◽  
Thomas Nagler ◽  
Martin Jullum ◽  
Anders Løland

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.

2005 ◽  
Vol 30 (4) ◽  
pp. 369-396 ◽  
Author(s):  
Eisuke Segawa

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.


2020 ◽  
Author(s):  
Edlin J. Guerra-Castro ◽  
Juan Carlos Cajas ◽  
Nuno Simões ◽  
Juan J Cruz-Motta ◽  
Maite Mascaró

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.


Author(s):  
J. DIEBOLT ◽  
M.-A. EL-AROUI ◽  
V. DURBEC ◽  
B. VILLAIN

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Qiang Yu ◽  
Hongwei Huo ◽  
Dazheng Feng

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


2020 ◽  
Author(s):  
Marton Soskuthy

Generalised additive mixed models (GAMMs) are increasingly popular in dynamic speech analysis, where the focus is on measurements with temporal or spatial structure such as formant, pitch or tongue contours. GAMMs provide a range of tools for dealing with the non-linear contour shapes and complex hierarchical organisation characteristic of such data sets. This, however, means that analysts are faced with non-trivial choices, many of which have a serious impact on the statistical validity of their analyses. This paper presents type I and type II error simulations to help researchers make informed decisions about modelling strategies when using GAMMs to analyse phonetic data. The simulations are based on two real data sets containing F2 and pitch contours, and a simulated data set modelled after the F2 data. They reflect typical scenarios in dynamic speech analysis. The main emphasis is on (i) dealing with dependencies within contours and higher-level units using random structures and other tools, and (ii) strategies for significance testing using GAMMs. The paper concludes with a small set of recommendations for fitting GAMMs, and provides advice on diagnosing issues and tailoring GAMMs to specific data sets. It is also accompanied by a GitHub repository including a tutorial on running type I error simulations for existing data sets: https://github.com/soskuthy/gamm_strategies.


2021 ◽  
Vol 14 (12) ◽  
pp. 612
Author(s):  
Jianan Zhu ◽  
Yang Feng

We propose a new ensemble classification algorithm, named super random subspace ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the random subspace ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show that the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated data sets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.


Sensors ◽  
2020 ◽  
Vol 20 (11) ◽  
pp. 3270 ◽  
Author(s):  
Baris Satar ◽  
Gokhan Soysal ◽  
Xue Jiang ◽  
Murat Efe ◽  
Thiagalingam Kirubarajan

Conventional methods such as matched filtering, fractional lower order statistics cross ambiguity function, and recent methods such as compressed sensing and track-before-detect are used for target detection by passive radars. Target detection using these algorithms usually assumes that the background noise is Gaussian. However, non-Gaussian impulsive noise is inherent in real world radar problems. In this paper, a new optimization based algorithm that uses weighted l 1 and l 2 norms is proposed as an alternative to the existing algorithms whose performance degrades in the presence of impulsive noise. To determine the weights of these norms, the parameter that quantifies the impulsiveness level of the noise is estimated. In the proposed algorithm, the aim is to increase the target detection performance of a universal mobile telecommunication system (UMTS) based passive radars by facilitating higher resolution with better suppression of the sidelobes in both range and Doppler. The results obtained from both simulated data with α stable distribution, and real data recorded by a UMTS based passive radar platform are presented to demonstrate the superiority of the proposed algorithm. The results show that the proposed algorithm provides more robust and accurate detection performance for noise models with different impulsiveness levels compared to the conventional methods.


Sensors ◽  
2021 ◽  
Vol 21 (10) ◽  
pp. 3406
Author(s):  
Jie Jiang ◽  
Yin Zou ◽  
Lidong Chen ◽  
Yujie Fang

Precise localization and pose estimation in indoor environments are commonly employed in a wide range of applications, including robotics, augmented reality, and navigation and positioning services. Such applications can be solved via visual-based localization using a pre-built 3D model. The increase in searching space associated with large scenes can be overcome by retrieving images in advance and subsequently estimating the pose. The majority of current deep learning-based image retrieval methods require labeled data, which increase data annotation costs and complicate the acquisition of data. In this paper, we propose an unsupervised hierarchical indoor localization framework that integrates an unsupervised network variational autoencoder (VAE) with a visual-based Structure-from-Motion (SfM) approach in order to extract global and local features. During the localization process, global features are applied for the image retrieval at the level of the scene map in order to obtain candidate images, and are subsequently used to estimate the pose from 2D-3D matches between query and candidate images. RGB images only are used as the input of the proposed localization system, which is both convenient and challenging. Experimental results reveal that the proposed method can localize images within 0.16 m and 4° in the 7-Scenes data sets and 32.8% within 5 m and 20° in the Baidu data set. Furthermore, our proposed method achieves a higher precision compared to advanced methods.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


2021 ◽  
pp. gr.273631.120
Author(s):  
Xinhao Liu ◽  
Huw A Ogilvie ◽  
Luay Nakhleh

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.


Sign in / Sign up

Export Citation Format

Share Document