Bayesian wavelet de-noising with the caravan prior

According to both domain expert knowledge and empirical evidence, wavelet coefficients of real signals tend to exhibit clustering patterns, in that they contain connected regions of coefficients of similar magnitude (large or small). A wavelet de-noising approach that takes into account such a feature of the signal may in practice outperform other, more vanilla methods, both in terms of the estimation error and visual appearance of the estimates. Motivated by this observation, we present a Bayesian approach to wavelet de-noising, where dependencies between neighbouring wavelet coefficients are a priori modelled via a Markov chain-based prior, that we term the caravan prior. Posterior computations in our method are performed via the Gibbs sampler. Using representative synthetic and real data examples, we conduct a detailed comparison of our approach with a benchmark empirical Bayes de-noising method (due to Johnstone and Silverman). We show that the caravan prior fares well and is therefore a useful addition to the wavelet de-noising toolbox.

Download Full-text

Priors for Genotyping Polyploids

Bioinformatics ◽

10.1093/bioinformatics/btz852 ◽

2019 ◽

Author(s):

David Gerard ◽

Luís Felipe Ventorim Ferrão

Keyword(s):

Empirical Bayes ◽

A Priori ◽

Real Data ◽

R Package ◽

Complete Characterization ◽

Supplementary Information ◽

Genotype Distribution ◽

Systematic Biases ◽

Technical Artifacts

Abstract Motivation Empirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class that is either too flexible or too restrictive results in poor genotyping performance. Results We propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data, that using these classes results in superior genotyping performance. Availability and implementation Genotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this paper is available on GitHub: https://github.com/dcgerard/reproduce\_prior\_sims. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Priors for Genotyping Polyploids

10.1101/751784 ◽

2019 ◽

Author(s):

David Gerard ◽

Luís Felipe Ventorim Ferrão

Keyword(s):

Empirical Bayes ◽

A Priori ◽

Real Data ◽

R Package ◽

Complete Characterization ◽

Genotype Distribution ◽

Link Type ◽

Systematic Biases ◽

Technical Artifacts

AbstractMotivationEmpirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class that is either too flexible or too restrictive results in poor genotyping performance.ResultsWe propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data, that using these classes results in superior genotyping performance.Availability and ImplementationGenotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this paper is available on GitHub: https://github.com/dcgerard/[email protected]

Download Full-text

Penalized partial least squares for pleiotropy

BMC Bioinformatics ◽

10.1186/s12859-021-03968-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Camilo Broc ◽

Therese Truong ◽

Benoit Liquet

Keyword(s):

Least Squares ◽

Partial Least Squares ◽

Association Studies ◽

A Priori ◽

Simulated Data ◽

Real Data ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Multiple Traits ◽

Application Fields

Abstract Background The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. Results Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. Conclusion The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.

Download Full-text

Gas pipeline leakage detection in the presence of parameter uncertainty using robust extended Kalman filter

Transactions of the Institute of Measurement and Control ◽

10.1177/0142331221989117 ◽

2021 ◽

pp. 014233122198911

Author(s):

Mohadese Jahanian ◽

Amin Ramezani ◽

Ali Moarefianpour ◽

Mahdi Aliari Shouredeli

Keyword(s):

Kalman Filter ◽

Extended Kalman Filter ◽

Parameter Uncertainty ◽

Oil And Gas ◽

Estimation Error ◽

Real Data ◽

Gas Pipeline ◽

Pipeline System ◽

Parameter Uncertainties ◽

Simulation Results

One of the most significant systems that can be expressed by partial differential equations (PDEs) is the transmission pipeline system. To avoid the accidents that originated from oil and gas pipeline leakage, the exact location and quantity of leakage are required to be recognized. The designed goal is a leakage diagnosis based on the system model and the use of real data provided by transmission line systems. Nonlinear equations of the system have been extracted employing continuity and momentum equations. In this paper, the extended Kalman filter (EKF) is used to detect and locate the leakage and to attenuate the negative effects of measurement and process noises. Besides, a robust extended Kalman filter (REKF) is applied to compensate for the effect of parameter uncertainty. The quantity and the location of the occurred leakage are estimated along the pipeline. Simulation results show that REKF has better estimations of the leak and its location as compared with that of EKF. This filter is robust against process noise, measurement noise, parameter uncertainties, and guarantees a higher limit for the covariance of state estimation error as well. It is remarkable that simulation results are evaluated by OLGA software.

Download Full-text

Analysis of Residual Dependencies of Independent Components Extracted from fMRI Data

Computational Intelligence and Neuroscience ◽

10.1155/2016/2961727 ◽

2016 ◽

Vol 2016 ◽

pp. 1-15

Author(s):

N. Vanello ◽

E. Ricciardi ◽

L. Landini

Keyword(s):

Clustering Algorithm ◽

A Priori ◽

Similarity Measures ◽

Real Data ◽

Fmri Data ◽

Order Selection ◽

Functional Magnetic Resonance ◽

Model Order Selection ◽

Model Order ◽

Independent Components

Independent component analysis (ICA) of functional magnetic resonance imaging (fMRI) data can be employed as an exploratory method. The lack in the ICA model of strong a priori assumptions about the signal or about the noise leads to difficult interpretations of the results. Moreover, the statistical independence of the components is only approximated. Residual dependencies among the components can reveal informative structure in the data. A major problem is related to model order selection, that is, the number of components to be extracted. Specifically, overestimation may lead to component splitting. In this work, a method based on hierarchical clustering of ICA applied to fMRI datasets is investigated. The clustering algorithm uses a metric based on the mutual information between the ICs. To estimate the similarity measure, a histogram-based technique and one based on kernel density estimation are tested on simulated datasets. Simulations results indicate that the method could be used to cluster components related to the same task and resulting from a splitting process occurring at different model orders. Different performances of the similarity measures were found and discussed. Preliminary results on real data are reported and show that the method can group task related and transiently task related components.

Download Full-text

Modeling the Process of Event Sequence Data Generated for Working Condition Diagnosis

Mathematical Problems in Engineering ◽

10.1155/2015/693450 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13

Author(s):

Jianwei Ding ◽

Yingbo Liu ◽

Li Zhang ◽

Jianmin Wang

Keyword(s):

Working Condition ◽

Sequence Data ◽

A Priori ◽

Real Data ◽

Data Sets ◽

Main Task ◽

Event Sequence ◽

Telemetry Data ◽

Condition Monitoring Systems ◽

Condition Diagnosis

Condition monitoring systems are widely used to monitor the working condition of equipment, generating a vast amount and variety of telemetry data in the process. The main task of surveillance focuses on analyzing these routinely collected telemetry data to help analyze the working condition in the equipment. However, with the rapid increase in the volume of telemetry data, it is a nontrivial task to analyze all the telemetry data to understand the working condition of the equipment without any a priori knowledge. In this paper, we proposed a probabilistic generative model called working condition model (WCM), which is capable of simulating the process of event sequence data generated and depicting the working condition of equipment at runtime. With the help of WCM, we are able to analyze how the event sequence data behave in different working modes and meanwhile to detect the working mode of an event sequence (working condition diagnosis). Furthermore, we have applied WCM to illustrative applications like automated detection of an anomalous event sequence for the runtime of equipment. Our experimental results on the real data sets demonstrate the effectiveness of the model.

Download Full-text

High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis

Biometrika ◽

10.1093/biomet/asab020 ◽

2021 ◽

Author(s):

Pixu Shi ◽

Yuchen Zhou ◽

Anru R Zhang

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Estimation Error ◽

Real Data ◽

Upper And Lower Bounds ◽

High Dimensional ◽

Compositional Data Analysis ◽

Sequencing Data ◽

Contrast Model ◽

Critical Issues

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.

Download Full-text

Stochastic reservoir characterization using prestack seismic data

Geophysics ◽

10.1190/1.1778241 ◽

2004 ◽

Vol 69 (4) ◽

pp. 978-993 ◽

Cited By ~ 105

Author(s):

Jo Eidsvik ◽

Per Avseth ◽

Henning Omre ◽

Tapan Mukerji ◽

Gary Mavko

Keyword(s):

Reservoir Characterization ◽

A Priori ◽

Rock Physics ◽

Real Data ◽

Spatial Coupling ◽

The North ◽

Amplitude Versus Offset ◽

Likelihood Model ◽

Avo Attributes ◽

The Impact

Reservoir characterization must be based on information from various sources. Well observations, seismic reflection times, and seismic amplitude versus offset (AVO) attributes are integrated in this study to predict the distribution of the reservoir variables, i.e., facies and fluid filling. The prediction problem is cast in a Bayesian setting. The a priori model includes spatial coupling through Markov random field assumptions and intervariable dependencies through nonlinear relations based on rock physics theory, including Gassmann's relation. The likelihood model relating observations to reservoir variables (including lithology facies and pore fluids) is based on approximations to Zoeppritz equations. The model assumptions are summarized in a Bayesian network illustrating the dependencies between the reservoir variables. The posterior model for the reservoir variables conditioned on the available observations is defined by the a priori and likelihood models. This posterior model is not analytically tractable but can be explored by Markov chain Monte Carlo (MCMC) sampling. Realizations of reservoir variables from the posterior model are used to predict the facies and fluid‐filling distribution in the reservoir. A maximum a posteriori (MAP) criterion is used in this study to predict facies and pore‐fluid distributions. The realizations are also used to present probability maps for the favorable (sand, oil) occurrence in the reservoir. Finally, the impact of seismic AVO attributes—AVO gradient, in particular—is studied. The approach is demonstrated on real data from a turbidite sedimentary system in the North Sea. AVO attributes on the interface between reservoir and cap rock are extracted from 3D seismic AVO data. The AVO gradient is shown to be valuable in reducing the ambiguity between facies and fluids in the prediction.

Download Full-text

Comparison of Linguistic Summaries and Fuzzy Functional Dependencies Related to Data Mining

Advances in Data Mining and Database Management - Biologically-Inspired Techniques for Knowledge Discovery and Data Mining ◽

10.4018/978-1-4666-6078-6.ch008 ◽

2014 ◽

pp. 174-203 ◽

Cited By ~ 4

Author(s):

Miroslav Hudec ◽

Miljan Vučetić ◽

Mirko Vujošević

Keyword(s):

Data Mining ◽

Fuzzy Logic ◽

Relational Databases ◽

Missing Values ◽

Expert Knowledge ◽

Real Data ◽

Research Area ◽

Functional Dependencies ◽

Useful Knowledge ◽

Important Research Area

Data mining methods based on fuzzy logic have been developed recently and have become an increasingly important research area. In this chapter, the authors examine possibilities for discovering potentially useful knowledge from relational database by integrating fuzzy functional dependencies and linguistic summaries. Both methods use fuzzy logic tools for data analysis, acquiring, and representation of expert knowledge. Fuzzy functional dependencies could detect whether dependency between two examined attributes in the whole database exists. If dependency exists only between parts of examined attributes' domains, fuzzy functional dependencies cannot detect its characters. Linguistic summaries are a convenient method for revealing this kind of dependency. Using fuzzy functional dependencies and linguistic summaries in a complementary way could mine valuable information from relational databases. Mining intensities of dependencies between database attributes could support decision making, reduce the number of attributes in databases, and estimate missing values. The proposed approach is evaluated with case studies using real data from the official statistics. Strengths and weaknesses of the described methods are discussed. At the end of the chapter, topics for further research activities are outlined.

Download Full-text

A General Point-Based Method for Self-Calibration of Terrestrial Laser Scanners Considering Stochastic Information

Remote Sensing ◽

10.3390/rs12182923 ◽

2020 ◽

Vol 12 (18) ◽

pp. 2923

Author(s):

Tengfei Zhou ◽

Xiaojun Cheng ◽

Peng Lin ◽

Zhenlun Wu ◽

Ensheng Liu

Keyword(s):

Point Cloud ◽

A Priori ◽

Real Data ◽

Point Clouds ◽

Estimation Algorithm ◽

General Point ◽

Exterior Orientation ◽

Self Calibration ◽

Laser Scanners ◽

Orientation Parameters

Due to the existence of environmental or human factors, and because of the instrument itself, there are many uncertainties in point clouds, which directly affect the data quality and the accuracy of subsequent processing, such as point cloud segmentation, 3D modeling, etc. In this paper, to address this problem, stochastic information of point cloud coordinates is taken into account, and on the basis of the scanner observation principle within the Gauss–Helmert model, a novel general point-based self-calibration method is developed for terrestrial laser scanners, incorporating both five additional parameters and six exterior orientation parameters. For cases where the instrument accuracy is different from the nominal ones, the variance component estimation algorithm is implemented for reweighting the outliers after the residual errors of observations obtained. Considering that the proposed method essentially is a nonlinear model, the Gauss–Newton iteration method is applied to derive the solutions of additional parameters and exterior orientation parameters. We conducted experiments using simulated and real data and compared them with those two existing methods. The experimental results showed that the proposed method could improve the point accuracy from 10−4 to 10−8 (a priori known) and 10−7 (a priori unknown), and reduced the correlation among the parameters (approximately 60% of volume). However, it is undeniable that some correlations increased instead, which is the limitation of the general method.

Download Full-text