Empirical evaluation of directional-dependence tests

Testing of directional dependence is a method to infer causal direction that recently has attracted some attention. Previous examples by e.g. von Eye and DeShon (2012a) and extensive simulation studies by Pornprasertmanit and Little (2012) have demonstrated that under specific assumptions, directional-dependence tests can recover the true causal direction between two variables. Simulation results are important in the evaluation of any statistical method, but they are necessarily less complex than real data that come with potential irregularities (e.g. departures from linearity, presence of confounders, etc.). In this article, we evaluate the performance of directional-dependence tests using benchmark data consisting of 65 variable pairs with known causal order. We find that between 21% and 43% of all cases are correctly classified using different directional-dependence tests that rely on differences in skew, kurtosis, or a combined measure. We then examine some of the assumptions of the directional-dependence test and find that for virtually all variable pairs, some assumptions are violated. When only pairs in which assumptions are fulfilled are selected, performance of all directional-dependence tests improves. We probe whether particular features of the variable pairs impact whether a test yields a correct or incorrect result, but find no strong predictors. Our findings provide a complimentary picture to previously conducted simulation studies, and highlight the fact that directional-dependence tests are prone to causal classification errors when key assumptions are violated. Such violations are potentially common in real data.

Download Full-text

On the effect of measurement errors in simultaneous monitoring of mean vector and covariance matrix of multivariate processes

Transactions of the Institute of Measurement and Control ◽

10.1177/0142331216656756 ◽

2016 ◽

Vol 40 (1) ◽

pp. 318-330 ◽

Cited By ~ 8

Author(s):

Amirhossein Amiri ◽

Reza Ghashghaei ◽

Mohammad Reza Maleki

Keyword(s):

Measurement Errors ◽

Control Charts ◽

Moving Average ◽

Real Data ◽

Simulation Studies ◽

Process Mean ◽

Extensive Simulation ◽

Monitoring Procedure ◽

Mean Vector ◽

Weighted Moving Average

In this paper, we investigate the misleading effect of measurement errors on simultaneous monitoring of the multivariate process mean and variability. For this purpose, we incorporate the measurement errors into a hybrid method based on the generalized likelihood ratio (GLR) and exponentially weighted moving average (EWMA) control charts. After that, we propose four remedial methods to decrease the effects of measurement errors on the performance of the monitoring procedure. The performance of the monitoring procedure as well as the proposed remedial methods is investigated through extensive simulation studies and a real data example.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A comprehensive evaluation of methods for Mendelian randomization using realistic simulations and an analysis of 38 biomarkers for risk of type 2 diabetes

International Journal of Epidemiology ◽

10.1093/ije/dyaa262 ◽

2021 ◽

Author(s):

Guanghao Qi ◽

Nilanjan Chatterjee

Keyword(s):

Type 2 Diabetes ◽

Mendelian Randomization ◽

Association Studies ◽

Real Data ◽

Causal Effects ◽

Type I ◽

Genome Wide Association Studies ◽

Simulation Studies ◽

Sample Sizes

Abstract Background Previous studies have often evaluated methods for Mendelian randomization (MR) analysis based on simulations that do not adequately reflect the data-generating mechanisms in genome-wide association studies (GWAS) and there are often discrepancies in the performance of MR methods in simulations and real data sets. Methods We use a simulation framework that generates data on full GWAS for two traits under a realistic model for effect-size distribution coherent with the heritability, co-heritability and polygenicity typically observed for complex traits. We further use recent data generated from GWAS of 38 biomarkers in the UK Biobank and performed down sampling to investigate trends in estimates of causal effects of these biomarkers on the risk of type 2 diabetes (T2D). Results Simulation studies show that weighted mode and MRMix are the only two methods that maintain the correct type I error rate in a diverse set of scenarios. Between the two methods, MRMix tends to be more powerful for larger GWAS whereas the opposite is true for smaller sample sizes. Among the other methods, random-effect IVW (inverse-variance weighted method), MR-Robust and MR-RAPS (robust adjust profile score) tend to perform best in maintaining a low mean-squared error when the InSIDE assumption is satisfied, but can produce large bias when InSIDE is violated. In real-data analysis, some biomarkers showed major heterogeneity in estimates of their causal effects on the risk of T2D across the different methods and estimates from many methods trended in one direction with increasing sample size with patterns similar to those observed in simulation studies. Conclusion The relative performance of different MR methods depends heavily on the sample sizes of the underlying GWAS, the proportion of valid instruments and the validity of the InSIDE assumption. Down-sampling analysis can be used in large GWAS for the possible detection of bias in the MR methods.

Download Full-text

Classification with Fuzzification Optimization Combining Fuzzy Information Systems and Type-2 Fuzzy Inference

Applied Sciences ◽

10.3390/app11083484 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3484

Author(s):

Martin Tabakov ◽

Adrian Chlopowiec ◽

Adam Chlopowiec ◽

Adam Dlubak

Keyword(s):

Fuzzy Inference ◽

Real Data ◽

Fuzzy Rules ◽

Fuzzy Information ◽

Benchmark Data ◽

System A ◽

Footprint Of Uncertainty ◽

Interval Type ◽

Derived Rules

In this research, we introduce a classification procedure based on rule induction and fuzzy reasoning. The classifier generalizes attribute information to handle uncertainty, which often occurs in real data. To induce fuzzy rules, we define the corresponding fuzzy information system. A transformation of the derived rules into interval type-2 fuzzy rules is provided as well. The fuzzification applied is optimized with respect to the footprint of uncertainty of the corresponding type-2 fuzzy sets. The classification process is related to a Mamdani type fuzzy inference. The method proposed was evaluated by the F-score measure on benchmark data.

Download Full-text

Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441453 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-20

Author(s):

Georg Steinbuss ◽

Klemens Böhm

Keyword(s):

Outlier Detection ◽

Synthetic Data ◽

Real Data ◽

Detection Methods ◽

Quality Of Data ◽

Benchmark Data ◽

Core Idea ◽

Generic Process ◽

Unsupervised Outlier Detection

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

Download Full-text

Gas pipeline leakage detection in the presence of parameter uncertainty using robust extended Kalman filter

Transactions of the Institute of Measurement and Control ◽

10.1177/0142331221989117 ◽

2021 ◽

pp. 014233122198911

Author(s):

Mohadese Jahanian ◽

Amin Ramezani ◽

Ali Moarefianpour ◽

Mahdi Aliari Shouredeli

Keyword(s):

Kalman Filter ◽

Extended Kalman Filter ◽

Parameter Uncertainty ◽

Oil And Gas ◽

Estimation Error ◽

Real Data ◽

Gas Pipeline ◽

Pipeline System ◽

Parameter Uncertainties ◽

Simulation Results

One of the most significant systems that can be expressed by partial differential equations (PDEs) is the transmission pipeline system. To avoid the accidents that originated from oil and gas pipeline leakage, the exact location and quantity of leakage are required to be recognized. The designed goal is a leakage diagnosis based on the system model and the use of real data provided by transmission line systems. Nonlinear equations of the system have been extracted employing continuity and momentum equations. In this paper, the extended Kalman filter (EKF) is used to detect and locate the leakage and to attenuate the negative effects of measurement and process noises. Besides, a robust extended Kalman filter (REKF) is applied to compensate for the effect of parameter uncertainty. The quantity and the location of the occurred leakage are estimated along the pipeline. Simulation results show that REKF has better estimations of the leak and its location as compared with that of EKF. This filter is robust against process noise, measurement noise, parameter uncertainties, and guarantees a higher limit for the covariance of state estimation error as well. It is remarkable that simulation results are evaluated by OLGA software.

Download Full-text

Dynamics Equations of Robots Mounted on Moving Bases

ASME 1991 Computers in Engineering Conference: Volume 2 — Finite Elements/Computational Geometry; Computers in Education; Robotics and Controls ◽

10.1115/cie1991-0169 ◽

1991 ◽

Author(s):

R. W. Toogood

Keyword(s):

Angular Velocity ◽

Control Systems ◽

Linear Acceleration ◽

Computer Code ◽

Dynamics Simulation ◽

Simulation Studies ◽

Flexible Link ◽

Paper Briefly ◽

Simulation Results ◽

Simplification Techniques

Abstract A number of programs have been developed for the automatic symbolic generation of efficient computer code for the dynamic analysis of serial rigid and flexible link manipulators. Code for both the inverse and the direct dynamics computations can be generated. The symbolic generators allow the robot base to be given an arbitrary linear acceleration anchor angular velocity and acceleration. The efficiency of the generated code is an important consideration for simulation studies and/or implementation in control systems. This paper briefly describes the symbolic generation and simplification techniques. The added computational load due to including the base motion is discussed. Some dynamics simulation results are presented for a 3R rigid link manipulator mounted on an oscillating base, which graphically illustrates the effect of the base movement on the dynamics.

Download Full-text

Hybrid Chirp Signal Design for Improved Long-Range (LoRa) Communications

Signals ◽

10.3390/signals3010001 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-10

Author(s):

Md. Noor-A-Rahim ◽

M. Omar Khyam ◽

Apel Mahmud ◽

Xinde Li ◽

Dirk Pesch ◽

...

Keyword(s):

Long Range ◽

Bit Error Rate ◽

Modulation Scheme ◽

Signal Design ◽

Chirp Signal ◽

Performance Gain ◽

Extensive Simulation ◽

Noise Interference ◽

Chirp Signals ◽

Simulation Results

Long-range (LoRa) communication has attracted much attention recently due to its utility for many Internet of Things applications. However, one of the key problems of LoRa technology is that it is vulnerable to noise/interference due to the use of only up-chirp signals during modulation. In this paper, to solve this problem, unlike the conventional LoRa modulation scheme, we propose a modulation scheme for LoRa communication based on joint up- and down-chirps. A fast Fourier transform (FFT)-based demodulation scheme is devised to detect modulated symbols. To further improve the demodulation performance, a hybrid demodulation scheme, comprised of FFT- and correlation-based demodulation, is also proposed. The performance of the proposed scheme is evaluated through extensive simulation results. Compared to the conventional LoRa modulation scheme, we show that the proposed scheme exhibits over 3 dB performance gain at a bit error rate of 10−4.

Download Full-text

Slash Truncation Positive Normal Distribution and Its Estimation Based on the EM Algorithm

Symmetry ◽

10.3390/sym13112164 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2164

Author(s):

Héctor J. Gómez ◽

Diego I. Gallardo ◽

Karol I. Santoro

Keyword(s):

Expectation Maximization Algorithm ◽

Likelihood Estimation ◽

Real Data ◽

Simulation Studies ◽

R Software ◽

Data Application ◽

The Em Algorithm ◽

Kurtosis Parameter ◽

Computational Implementation ◽

Moments Method

In this paper, we present an extension of the truncated positive normal (TPN) distribution to model positive data with a high kurtosis. The new model is defined as the quotient between two random variables: the TPN distribution (numerator) and the power of a standard uniform distribution (denominator). The resulting model has greater kurtosis than the TPN distribution. We studied some properties of the distribution, such as moments, asymmetry, and kurtosis. Parameter estimation is based on the moments method, and maximum likelihood estimation uses the expectation-maximization algorithm. We performed some simulation studies to assess the recovery parameters and illustrate the model with a real data application related to body weight. The computational implementation of this work was included in the tpn package of the R software.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text