Toward Cumulative Cognitive Science: A Comparison of Meta-Analysis, Mega-Analysis, and Hybrid Approaches

Abstract There is increasing interest in cumulative approaches to science, in which instead of analyzing the results of individual papers separately, we integrate information qualitatively or quantitatively. One such approach is meta-analysis, which has over 50 years of literature supporting its usefulness, and is becoming more common in cognitive science. However, changes in technical possibilities by the widespread use of Python and R make it easier to fit more complex models, and even simulate missing data. Here we recommend the use of mega-analyses (based on the aggregation of data sets collected by independent researchers) and hybrid meta- mega-analytic approaches, for cases where raw data is available for some studies. We illustrate the three approaches using a rich test-retest data set of infants’ speech processing as well as synthetic data. We discuss advantages and disadvantages of the three approaches from the viewpoint of a cognitive scientists contemplating their use, and limitations of this article, to be addressed in future work.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques ◽

10.5194/amt-7-781-2014 ◽

2014 ◽

Vol 7 (3) ◽

pp. 781-797 ◽

Cited By ~ 174

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Environmental Protection Agency ◽

Synthetic Data ◽

Analytic Solutions ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

Sparse reflectivity inversion for nonstationary seismic data with surface-related multiples: Numerical and field-data experiments

Geophysics ◽

10.1190/geo2016-0520.1 ◽

2017 ◽

Vol 82 (3) ◽

pp. R199-R217 ◽

Cited By ~ 3

Author(s):

Xintao Chai ◽

Shangxu Wang ◽

Genyang Tang

Keyword(s):

Seismic Data ◽

Resolution Enhancement ◽

Synthetic Data ◽

Data Sets ◽

Data Set ◽

Anelastic Attenuation ◽

Seismic Resolution ◽

Text Filtering ◽

The Stability ◽

Reflectivity Inversion

Seismic data are nonstationary due to subsurface anelastic attenuation and dispersion effects. These effects, also referred to as the earth’s [Formula: see text]-filtering effects, can diminish seismic resolution. We previously developed a method of nonstationary sparse reflectivity inversion (NSRI) for resolution enhancement, which avoids the intrinsic instability associated with inverse [Formula: see text] filtering and generates superior [Formula: see text] compensation results. Applying NSRI to data sets that contain multiples (addressing surface-related multiples only) requires a demultiple preprocessing step because NSRI cannot distinguish primaries from multiples and will treat them as interference convolved with incorrect [Formula: see text] values. However, multiples contain information about subsurface properties. To use information carried by multiples, with the feedback model and NSRI theory, we adapt NSRI to the context of nonstationary seismic data with surface-related multiples. Consequently, not only are the benefits of NSRI (e.g., circumventing the intrinsic instability associated with inverse [Formula: see text] filtering) extended, but also multiples are considered. Our method is limited to be a 1D implementation. Theoretical and numerical analyses verify that given a wavelet, the input [Formula: see text] values primarily affect the inverted reflectivities and exert little effect on the estimated multiples; i.e., multiple estimation need not consider [Formula: see text] filtering effects explicitly. However, there are benefits for NSRI considering multiples. The periodicity and amplitude of the multiples imply the position of the reflectivities and amplitude of the wavelet. Multiples assist in overcoming scaling and shifting ambiguities of conventional problems in which multiples are not considered. Experiments using a 1D algorithm on a synthetic data set, the publicly available Pluto 1.5 data set, and a marine data set support the aforementioned findings and reveal the stability, capabilities, and limitations of the proposed method.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

Marlim R3D: A realistic model for controlled-source electromagnetic simulations — Phase 2: The controlled-source electromagnetic data set

Geophysics ◽

10.1190/geo2018-0452.1 ◽

2019 ◽

Vol 84 (5) ◽

pp. E293-E299

Author(s):

Jorlivan L. Correa ◽

Paulo T. L. Menezes

Keyword(s):

A Priori ◽

Synthetic Data ◽

Realistic Model ◽

Earth Model ◽

Data Sets ◽

Data Set ◽

Geoelectric Model ◽

Controlled Source ◽

The North ◽

Electromagnetic Simulations

Synthetic data provided by geoelectric earth models are a powerful tool to evaluate a priori a controlled-source electromagnetic (CSEM) workflow effectiveness. Marlim R3D (MR3D) is an open-source complex and realistic geoelectric model for CSEM simulations of the postsalt turbiditic reservoirs at the Brazilian offshore margin. We have developed a 3D CSEM finite-difference time-domain forward study to generate the full-azimuth CSEM data set for the MR3D earth model. To that end, we fabricated a full-azimuth survey with 45 towlines striking the north–south and east–west directions over a total of 500 receivers evenly spaced at 1 km intervals along the rugged seafloor of the MR3D model. To correctly represent the thin, disconnected, and complex geometries of the studied reservoirs, we have built a finely discretized mesh of [Formula: see text] cells leading to a large mesh with a total of approximately 90 million cells. We computed the six electromagnetic field components (Ex, Ey, Ez, Hx, Hy, and Hz) at six frequencies in the range of 0.125–1.25 Hz. In our efforts to mimic noise in real CSEM data, we summed to the data a multiplicative noise with a 1% standard deviation. Both CSEM data sets (noise free and noise added), with inline and broadside geometries, are distributed for research or commercial use, under the Creative Common License, at the Zenodo platform.

Download Full-text

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-1014-6 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 8

Author(s):

André M. Carrington ◽

Paul W. Fieguth ◽

Hammad Qazi ◽

Andreas Holzinger ◽

Helen H. Chen ◽

...

Keyword(s):

Roc Curve ◽

Diagnostic Testing ◽

Real Life ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Data Set ◽

Partial Auc ◽

C Statistic ◽

Future Work

Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.

Download Full-text

Detection of sharp lateral discontinuities through the analysis of surface-wave propagation

Geophysics ◽

10.1190/geo2013-0314.1 ◽

2014 ◽

Vol 79 (4) ◽

pp. EN77-EN90 ◽

Cited By ~ 19

Author(s):

Paolo Bergamo ◽

Laura Valentina Socco

Keyword(s):

Surface Wave ◽

Fault Location ◽

Synthetic Data ◽

Finite Element Method Simulation ◽

Fault System ◽

Energy Concentration ◽

Data Sets ◽

Data Set ◽

Shallow Subsurface ◽

Velocity Models

Surface-wave (SW) techniques are mainly used to retrieve 1D velocity models and are therefore characterized by a 1D approach, which might prove unsatisfactory when relevant 2D effects are present in the investigated subsurface. In the case of sharp and sudden lateral heterogeneities in the subsurface, a strategy to tackle this limitation is to estimate the location of the discontinuities and to separately process seismic traces belonging to quasi-1D subsurface portions. We have addressed our attention to methods aimed at locating discontinuities by identifying anomalies in SW propagation and attenuation. The considered methods are the autospectrum computation and the attenuation analysis of Rayleigh waves (AARW). These methods were developed for purposes and/or scales of analysis that are different from those of this work, which aims at detecting and characterizing sharp subvertical discontinuities in the shallow subsurface. We applied both methods to two data sets, synthetic data from a finite-element method simulation and a field data set acquired over a fault system, both presenting an abrupt lateral variation perpendicularly crossing the acquisition line. We also extended the AARW method to the detection of sharp discontinuities from large and multifold data sets and we tested these novel procedures on the field case. The two methods are proven to be effective for the detection of the discontinuity, by portraying propagation phenomena linked to the presence of the heterogeneity, such as the interference between incident and reflected wavetrains, and energy concentration as well as subsequent decay at the fault location. The procedures we developed for the processing of multifold seismic data set showed to be reliable tools in locating and characterizing subvertical sharp heterogeneities.

Download Full-text

Has open data arrived at the British Medical Journal (BMJ)? An observational study

BMJ Open ◽

10.1136/bmjopen-2016-011784 ◽

2016 ◽

Vol 6 (10) ◽

pp. e011784 ◽

Cited By ~ 27

Author(s):

Anisa Rowhani-Farid ◽

Adrian G Barnett

Keyword(s):

Medical Journal ◽

British Medical Journal ◽

Observational Study ◽

Data Sharing ◽

Meta Analysis ◽

Open Data ◽

Research Articles ◽

Data Sets ◽

Data Set ◽

Article 50

ObjectiveTo quantify data sharing trends and data sharing policy compliance at the British Medical Journal (BMJ) by analysing the rate of data sharing practices, and investigate attitudes and examine barriers towards data sharing.DesignObservational study.SettingThe BMJ research archive.Participants160 randomly sampled BMJ research articles from 2009 to 2015, excluding meta-analysis and systematic reviews.Main outcome measuresPercentages of research articles that indicated the availability of their raw data sets in their data sharing statements, and those that easily made their data sets available on request.Results3 articles contained the data in the article. 50 out of 157 (32%) remaining articles indicated the availability of their data sets. 12 used publicly available data and the remaining 38 were sent email requests to access their data sets. Only 1 publicly available data set could be accessed and only 6 out of 38 shared their data via email. So only 7/157 research articles shared their data sets, 4.5% (95% CI 1.8% to 9%). For 21 clinical trials bound by the BMJ data sharing policy, the per cent shared was 24% (8% to 47%).ConclusionsDespite the BMJ's strong data sharing policy, sharing rates are low. Possible explanations for low data sharing rates could be: the wording of the BMJ data sharing policy, which leaves room for individual interpretation and possible loopholes; that our email requests ended up in researchers spam folders; and that researchers are not rewarded for sharing their data. It might be time for a more effective data sharing policy and better incentives for health and medical researchers to share their data.

Download Full-text

A Bayesian approach to modeling 2D gravity data using polygons

Geophysics ◽

10.1190/geo2016-0153.1 ◽

2017 ◽

Vol 82 (1) ◽

pp. G1-G21 ◽

Cited By ~ 3

Author(s):

William J. Titus ◽

Sarah J. Titus ◽

Joshua R. Davis

Keyword(s):

Gravity Data ◽

Synthetic Data ◽

Gravity Inversion ◽

Limiting Factor ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Occupancy Probability ◽

Compute Model ◽

Parameter Values

We apply a Bayesian Markov chain Monte Carlo formalism to the gravity inversion of a single localized 2D subsurface object. The object is modeled as a polygon described by five parameters: the number of vertices, a density contrast, a shape-limiting factor, and the width and depth of an encompassing container. We first constrain these parameters with an interactive forward model and explicit geologic information. Then, we generate an approximate probability distribution of polygons for a given set of parameter values. From these, we determine statistical distributions such as the variance between the observed and model fields, the area, the center of area, and the occupancy probability (the probability that a spatial point lies within the subsurface object). We introduce replica exchange to mitigate trapping in local optima and to compute model probabilities and their uncertainties. We apply our techniques to synthetic data sets and a natural data set collected across the Rio Grande Gorge Bridge in New Mexico. On the basis of our examples, we find that the occupancy probability is useful in visualizing the results, giving a “hazy” cross section of the object. We also find that the role of the container is important in making predictions about the subsurface object.

Download Full-text

A periodically varying code for improving deblending of simultaneous sources in marine acquisition

Geophysics ◽

10.1190/geo2015-0447.1 ◽

2016 ◽

Vol 81 (3) ◽

pp. V213-V225 ◽

Cited By ~ 80

Author(s):

Shaohuan Zu ◽

Hui Zhou ◽

Yangkang Chen ◽

Shan Qu ◽

Xiaofeng Zou ◽

...

Keyword(s):

Field Data ◽

Synthetic Data ◽

Data Sets ◽

Ocean Bottom ◽

Data Set ◽

Acceptable Model ◽

Model Subspace ◽

New Form ◽

Simultaneous Source ◽

Practical Field

We have designed a periodically varying code that can avoid the problem of the local coherency and make the interference distribute uniformly in a given range; hence, it was better at suppressing incoherent interference (blending noise) and preserving coherent useful signals compared with a random dithering code. We have also devised a new form of the iterative method to remove interference generated from the simultaneous source acquisition. In each iteration, we have estimated the interference using the blending operator following the proposed formula and then subtracted the interference from the pseudodeblended data. To further eliminate the incoherent interference and constrain the inversion, the data were then transformed to an auxiliary sparse domain for applying a thresholding operator. During the iterations, the threshold was decreased from the largest value to zero following an exponential function. The exponentially decreasing threshold aimed to gradually pass the deblended data to a more acceptable model subspace. Two numerically blended synthetic data sets and one numerically blended practical field data set from an ocean bottom cable were used to demonstrate the usefulness of our proposed method and the better performance of the periodically varying code over the traditional random dithering code.

Download Full-text