A fingerprint of a heterogeneous data set

AbstractIn this paper, we describe the fingerprint method, a technique to classify bags of mixed-type measurements. The method was designed to solve a real-world industrial problem: classifying industrial plants (individuals at a higher level of organisation) starting from the measurements collected from their production lines (individuals at a lower level of organisation). In this specific application, the categorical information attached to the numerical measurements induced simple mixture-like structures on the global multivariate distributions associated with different classes. The fingerprint method is designed to compare the mixture components of a given test bag with the corresponding mixture components associated with the different classes, identifying the most similar generating distribution. When compared to other classification algorithms applied to several synthetic data sets and the original industrial data set, the proposed classifier showed remarkable improvements in performance.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques ◽

10.5194/amt-7-781-2014 ◽

2014 ◽

Vol 7 (3) ◽

pp. 781-797 ◽

Cited By ~ 174

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Environmental Protection Agency ◽

Synthetic Data ◽

Analytic Solutions ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

Sparse reflectivity inversion for nonstationary seismic data with surface-related multiples: Numerical and field-data experiments

Geophysics ◽

10.1190/geo2016-0520.1 ◽

2017 ◽

Vol 82 (3) ◽

pp. R199-R217 ◽

Cited By ~ 3

Author(s):

Xintao Chai ◽

Shangxu Wang ◽

Genyang Tang

Keyword(s):

Seismic Data ◽

Resolution Enhancement ◽

Synthetic Data ◽

Data Sets ◽

Data Set ◽

Anelastic Attenuation ◽

Seismic Resolution ◽

Text Filtering ◽

The Stability ◽

Reflectivity Inversion

Seismic data are nonstationary due to subsurface anelastic attenuation and dispersion effects. These effects, also referred to as the earth’s [Formula: see text]-filtering effects, can diminish seismic resolution. We previously developed a method of nonstationary sparse reflectivity inversion (NSRI) for resolution enhancement, which avoids the intrinsic instability associated with inverse [Formula: see text] filtering and generates superior [Formula: see text] compensation results. Applying NSRI to data sets that contain multiples (addressing surface-related multiples only) requires a demultiple preprocessing step because NSRI cannot distinguish primaries from multiples and will treat them as interference convolved with incorrect [Formula: see text] values. However, multiples contain information about subsurface properties. To use information carried by multiples, with the feedback model and NSRI theory, we adapt NSRI to the context of nonstationary seismic data with surface-related multiples. Consequently, not only are the benefits of NSRI (e.g., circumventing the intrinsic instability associated with inverse [Formula: see text] filtering) extended, but also multiples are considered. Our method is limited to be a 1D implementation. Theoretical and numerical analyses verify that given a wavelet, the input [Formula: see text] values primarily affect the inverted reflectivities and exert little effect on the estimated multiples; i.e., multiple estimation need not consider [Formula: see text] filtering effects explicitly. However, there are benefits for NSRI considering multiples. The periodicity and amplitude of the multiples imply the position of the reflectivities and amplitude of the wavelet. Multiples assist in overcoming scaling and shifting ambiguities of conventional problems in which multiples are not considered. Experiments using a 1D algorithm on a synthetic data set, the publicly available Pluto 1.5 data set, and a marine data set support the aforementioned findings and reveal the stability, capabilities, and limitations of the proposed method.

Download Full-text

Marlim R3D: A realistic model for controlled-source electromagnetic simulations — Phase 2: The controlled-source electromagnetic data set

Geophysics ◽

10.1190/geo2018-0452.1 ◽

2019 ◽

Vol 84 (5) ◽

pp. E293-E299

Author(s):

Jorlivan L. Correa ◽

Paulo T. L. Menezes

Keyword(s):

A Priori ◽

Synthetic Data ◽

Realistic Model ◽

Earth Model ◽

Data Sets ◽

Data Set ◽

Geoelectric Model ◽

Controlled Source ◽

The North ◽

Electromagnetic Simulations

Synthetic data provided by geoelectric earth models are a powerful tool to evaluate a priori a controlled-source electromagnetic (CSEM) workflow effectiveness. Marlim R3D (MR3D) is an open-source complex and realistic geoelectric model for CSEM simulations of the postsalt turbiditic reservoirs at the Brazilian offshore margin. We have developed a 3D CSEM finite-difference time-domain forward study to generate the full-azimuth CSEM data set for the MR3D earth model. To that end, we fabricated a full-azimuth survey with 45 towlines striking the north–south and east–west directions over a total of 500 receivers evenly spaced at 1 km intervals along the rugged seafloor of the MR3D model. To correctly represent the thin, disconnected, and complex geometries of the studied reservoirs, we have built a finely discretized mesh of [Formula: see text] cells leading to a large mesh with a total of approximately 90 million cells. We computed the six electromagnetic field components (Ex, Ey, Ez, Hx, Hy, and Hz) at six frequencies in the range of 0.125–1.25 Hz. In our efforts to mimic noise in real CSEM data, we summed to the data a multiplicative noise with a 1% standard deviation. Both CSEM data sets (noise free and noise added), with inline and broadside geometries, are distributed for research or commercial use, under the Creative Common License, at the Zenodo platform.

Download Full-text

Detection of sharp lateral discontinuities through the analysis of surface-wave propagation

Geophysics ◽

10.1190/geo2013-0314.1 ◽

2014 ◽

Vol 79 (4) ◽

pp. EN77-EN90 ◽

Cited By ~ 19

Author(s):

Paolo Bergamo ◽

Laura Valentina Socco

Keyword(s):

Surface Wave ◽

Fault Location ◽

Synthetic Data ◽

Finite Element Method Simulation ◽

Fault System ◽

Energy Concentration ◽

Data Sets ◽

Data Set ◽

Shallow Subsurface ◽

Velocity Models

Surface-wave (SW) techniques are mainly used to retrieve 1D velocity models and are therefore characterized by a 1D approach, which might prove unsatisfactory when relevant 2D effects are present in the investigated subsurface. In the case of sharp and sudden lateral heterogeneities in the subsurface, a strategy to tackle this limitation is to estimate the location of the discontinuities and to separately process seismic traces belonging to quasi-1D subsurface portions. We have addressed our attention to methods aimed at locating discontinuities by identifying anomalies in SW propagation and attenuation. The considered methods are the autospectrum computation and the attenuation analysis of Rayleigh waves (AARW). These methods were developed for purposes and/or scales of analysis that are different from those of this work, which aims at detecting and characterizing sharp subvertical discontinuities in the shallow subsurface. We applied both methods to two data sets, synthetic data from a finite-element method simulation and a field data set acquired over a fault system, both presenting an abrupt lateral variation perpendicularly crossing the acquisition line. We also extended the AARW method to the detection of sharp discontinuities from large and multifold data sets and we tested these novel procedures on the field case. The two methods are proven to be effective for the detection of the discontinuity, by portraying propagation phenomena linked to the presence of the heterogeneity, such as the interference between incident and reflected wavetrains, and energy concentration as well as subsequent decay at the fault location. The procedures we developed for the processing of multifold seismic data set showed to be reliable tools in locating and characterizing subvertical sharp heterogeneities.

Download Full-text

A Bayesian approach to modeling 2D gravity data using polygons

Geophysics ◽

10.1190/geo2016-0153.1 ◽

2017 ◽

Vol 82 (1) ◽

pp. G1-G21 ◽

Cited By ~ 3

Author(s):

William J. Titus ◽

Sarah J. Titus ◽

Joshua R. Davis

Keyword(s):

Gravity Data ◽

Synthetic Data ◽

Gravity Inversion ◽

Limiting Factor ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Occupancy Probability ◽

Compute Model ◽

Parameter Values

We apply a Bayesian Markov chain Monte Carlo formalism to the gravity inversion of a single localized 2D subsurface object. The object is modeled as a polygon described by five parameters: the number of vertices, a density contrast, a shape-limiting factor, and the width and depth of an encompassing container. We first constrain these parameters with an interactive forward model and explicit geologic information. Then, we generate an approximate probability distribution of polygons for a given set of parameter values. From these, we determine statistical distributions such as the variance between the observed and model fields, the area, the center of area, and the occupancy probability (the probability that a spatial point lies within the subsurface object). We introduce replica exchange to mitigate trapping in local optima and to compute model probabilities and their uncertainties. We apply our techniques to synthetic data sets and a natural data set collected across the Rio Grande Gorge Bridge in New Mexico. On the basis of our examples, we find that the occupancy probability is useful in visualizing the results, giving a “hazy” cross section of the object. We also find that the role of the container is important in making predictions about the subsurface object.

Download Full-text

Water vapor mapping by fusing InSAR and GNSS remote sensing data and atmospheric simulations

Hydrology and Earth System Sciences ◽

10.5194/hess-19-4747-2015 ◽

2015 ◽

Vol 19 (12) ◽

pp. 4747-4764 ◽

Cited By ~ 5

Author(s):

F. Alshawaf ◽

B. Fersch ◽

S. Hinz ◽

H. Kunstmann ◽

M. Mayer ◽

...

Keyword(s):

Remote Sensing ◽

Water Vapor ◽

Data Fusion ◽

Spatial Resolution ◽

Heterogeneous Data ◽

Spatial Density ◽

Data Sets ◽

Data Set ◽

Fixed Rank Kriging ◽

Fixed Rank

Abstract. Data fusion aims at integrating multiple data sources that can be redundant or complementary to produce complete, accurate information of the parameter of interest. In this work, data fusion of precipitable water vapor (PWV) estimated from remote sensing observations and data from the Weather Research and Forecasting (WRF) modeling system are applied to provide complete grids of PWV with high quality. Our goal is to correctly infer PWV at spatially continuous, highly resolved grids from heterogeneous data sets. This is done by a geostatistical data fusion approach based on the method of fixed-rank kriging. The first data set contains absolute maps of atmospheric PWV produced by combining observations from the Global Navigation Satellite Systems (GNSS) and Interferometric Synthetic Aperture Radar (InSAR). These PWV maps have a high spatial density and a millimeter accuracy; however, the data are missing in regions of low coherence (e.g., forests and vegetated areas). The PWV maps simulated by the WRF model represent the second data set. The model maps are available for wide areas, but they have a coarse spatial resolution and a still limited accuracy. The PWV maps inferred by the data fusion at any spatial resolution show better qualities than those inferred from single data sets. In addition, by using the fixed-rank kriging method, the computational burden is significantly lower than that for ordinary kriging.

Download Full-text

A periodically varying code for improving deblending of simultaneous sources in marine acquisition

Geophysics ◽

10.1190/geo2015-0447.1 ◽

2016 ◽

Vol 81 (3) ◽

pp. V213-V225 ◽

Cited By ~ 80

Author(s):

Shaohuan Zu ◽

Hui Zhou ◽

Yangkang Chen ◽

Shan Qu ◽

Xiaofeng Zou ◽

...

Keyword(s):

Field Data ◽

Synthetic Data ◽

Data Sets ◽

Ocean Bottom ◽

Data Set ◽

Acceptable Model ◽

Model Subspace ◽

New Form ◽

Simultaneous Source ◽

Practical Field

We have designed a periodically varying code that can avoid the problem of the local coherency and make the interference distribute uniformly in a given range; hence, it was better at suppressing incoherent interference (blending noise) and preserving coherent useful signals compared with a random dithering code. We have also devised a new form of the iterative method to remove interference generated from the simultaneous source acquisition. In each iteration, we have estimated the interference using the blending operator following the proposed formula and then subtracted the interference from the pseudodeblended data. To further eliminate the incoherent interference and constrain the inversion, the data were then transformed to an auxiliary sparse domain for applying a thresholding operator. During the iterations, the threshold was decreased from the largest value to zero following an exponential function. The exponentially decreasing threshold aimed to gradually pass the deblended data to a more acceptable model subspace. Two numerically blended synthetic data sets and one numerically blended practical field data set from an ocean bottom cable were used to demonstrate the usefulness of our proposed method and the better performance of the periodically varying code over the traditional random dithering code.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques Discussions ◽

10.5194/amtd-6-7593-2013 ◽

2013 ◽

Vol 6 (4) ◽

pp. 7593-7631 ◽

Cited By ~ 9

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Synthetic Data ◽

Analytic Solutions ◽

The Other ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. EPA PMF version 5.0 and the underlying multilinear engine executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Elastic internal multiple analysis and attenuation using Marchenko and interferometric methods

Geophysics ◽

10.1190/geo2016-0162.1 ◽

2017 ◽

Vol 82 (2) ◽

pp. Q1-Q12 ◽

Cited By ~ 20

Author(s):

Carlos Alberto da Costa Filho ◽

Giovanni Angelo Meles ◽

Andrew Curtis

Keyword(s):

Synthetic Data ◽

Elastic Solid ◽

Original Data ◽

Data Sets ◽

Data Set ◽

Solid Media ◽

Elastic Data ◽

Vertical Density ◽

Internal Multiples ◽

Acoustic Approximation

Conventional seismic processing aims to create data that contain only primary reflections, whereas real seismic recordings also contain multiples. As such, it is desirable to predict, identify, and attenuate multiples in seismic data. This task is more difficult in elastic (solid) media because mode conversions create families of internal multiples not present in the acoustic case. We have developed a method to predict prestack internal multiples in general elastic media based on the Marchenko method and convolutional interferometry. It can be used to identify multiples directly in prestack data or migrated sections, as well as to attenuate internal multiples by adaptively subtracting them from the original data set. We developed the method on two synthetic data sets, the first composed of horizontal density layers and constant velocities, and the second containing horizontal and vertical density and velocity variations. The full-elastic method is computationally expensive and ideally uses data components that are not usually recorded. We therefore tested an acoustic approximation to the method on the synthetic elastic data from the second model and find that although the spatial resolution of the resulting image is reduced by this approximation, it provides images with relatively fewer artifacts. We conclude that in most cases where cost is a factor and we are willing to sacrifice some resolution, it may be sufficient to apply the acoustic version of this demultiple method.

Download Full-text