The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions

When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.

Download Full-text

Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set

Advances in Methods and Practices in Psychological Science ◽

10.1177/2515245920928007 ◽

2021 ◽

Vol 4 (1) ◽

pp. 251524592092800

Author(s):

Erin M. Buchanan ◽

Sarah E. Crain ◽

Ari L. Cunningham ◽

Hannah R. Johnson ◽

Hannah Stash ◽

...

Keyword(s):

Data Collection ◽

Data Sharing ◽

Search Engine ◽

Web Applications ◽

Data Sets ◽

Data Dictionary ◽

Data Set ◽

Entire Process ◽

Shared Data ◽

Source Data

As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

From pole to pole: 33 years of physical oceanography onboard R/V Polarstern

Earth System Science Data ◽

10.5194/essd-9-211-2017 ◽

2017 ◽

Vol 9 (1) ◽

pp. 211-220 ◽

Cited By ~ 7

Author(s):

Amelie Driemel ◽

Eberhard Fahrbach ◽

Gerd Rohardt ◽

Agnieszka Beszczynska-Möller ◽

Antje Boetius ◽

...

Keyword(s):

Data Collection ◽

Water Cycle ◽

Heat Budget ◽

Ocean Dynamics ◽

The Arctic ◽

Sensor Calibration ◽

Data Sets ◽

Calibration Data ◽

Data Set ◽

Different Characteristics

Abstract. Measuring temperature and salinity profiles in the world's oceans is crucial to understanding ocean dynamics and its influence on the heat budget, the water cycle, the marine environment and on our climate. Since 1983 the German research vessel and icebreaker Polarstern has been the platform of numerous CTD (conductivity, temperature, depth instrument) deployments in the Arctic and the Antarctic. We report on a unique data collection spanning 33 years of polar CTD data. In total 131 data sets (1 data set per cruise leg) containing data from 10 063 CTD casts are now freely available at doi:10.1594/PANGAEA.860066. During this long period five CTD types with different characteristics and accuracies have been used. Therefore the instruments and processing procedures (sensor calibration, data validation, etc.) are described in detail. This compilation is special not only with regard to the quantity but also the quality of the data – the latter indicated for each data set using defined quality codes. The complete data collection includes a number of repeated sections for which the quality code can be used to investigate and evaluate long-term changes. Beginning with 2010, the salinity measurements presented here are of the highest quality possible in this field owing to the introduction of the OPTIMARE Precision Salinometer.

Download Full-text

WHATEVER YOU WANT: INCONSISTENT RESULTS IS THE RULE, NOT THE EXCEPTION, IN THE STUDY OF PRIMATE BRAIN EVOLUTION

10.1101/454132 ◽

2018 ◽

Cited By ~ 1

Author(s):

Andreas Wartel ◽

Patrik Lindenfors ◽

Johan Lind

Keyword(s):

Body Size ◽

Brain Evolution ◽

Data Sets ◽

Data Set ◽

Scientific Rigor ◽

Explanatory Variables ◽

Primate Brain ◽

Residual Variation ◽

The Many ◽

Brain Data

AbstractPrimate brains differ in size and architecture. Hypotheses to explain this variation are numerous and many tests have been carried out. However, after body size has been accounted for there is little left to explain. The proposed explanatory variables for the residual variation are many and covary, both with each other and with body size. Further, the data sets used in analyses have been small, especially in light of the many proposed predictors. Here we report the complete list of models that results from exhaustively combining six commonly used predictors of brain and neocortex size. This provides an overview of how the output from standard statistical analyses changes when the inclusion of different predictors is altered. By using both the most commonly tested brain data set and a new, larger data set, we show that the choice of included variables fundamentally changes the conclusions as to what drives primate brain evolution. Our analyses thus reveal why studies have had troubles replicating earlier results and instead have come to such different conclusions. Although our results are somewhat disheartening, they highlight the importance of scientific rigor when trying to answer difficult questions. It is our position that there is currently no empirical justification to highlight any particular hypotheses, of those adaptive hypotheses we have examined here, as the main determinant of primate brain evolution.

Download Full-text

Extended Odd Fréchet-G Family of Distributions

Journal of Probability and Statistics ◽

10.1155/2018/2931326 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Suleman Nasiru

Keyword(s):

Maximum Likelihood Method ◽

Real Data ◽

Likelihood Method ◽

Data Sets ◽

Data Set ◽

New Class ◽

New Family ◽

Rate Functions ◽

The Given ◽

Family Of Distributions

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.

Download Full-text

Open-Source Data Collection and Data Sets for Activity Recognition in Smart Homes

Sensors ◽

10.3390/s20030879 ◽

2020 ◽

Vol 20 (3) ◽

pp. 879 ◽

Cited By ~ 2

Author(s):

Uwe Köckemann ◽

Marjan Alirezaie ◽

Jennifer Renoux ◽

Nicolas Tsiftes ◽

Mobyen Uddin Ahmed ◽

...

Keyword(s):

Data Collection ◽

Activity Recognition ◽

Care Home ◽

Open Data ◽

Ground Truth ◽

Smart Homes ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Home Setting

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.

Download Full-text

Monofractal or multifractal: a case study of spatial distribution of mining-induced seismic activity

Nonlinear Processes in Geophysics ◽

10.5194/npg-1-182-1994 ◽

1994 ◽

Vol 1 (2/3) ◽

pp. 182-190 ◽

Cited By ~ 16

Author(s):

M. Eneva

Keyword(s):

Spatial Distribution ◽

Seismic Activity ◽

Real Data ◽

Data Sets ◽

Point Sets ◽

Data Set ◽

Limited Size ◽

The Real ◽

Induced Seismic Activity ◽

Generalized Correlation

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

Interpolating wide-aperture ground-penetrating radar beyond aliasing

Geophysics ◽

10.1190/geo2014-0117.1 ◽

2015 ◽

Vol 80 (2) ◽

pp. H13-H22 ◽

Cited By ~ 8

Author(s):

Saulo S. Martins ◽

Jandyr M. Travassos

Keyword(s):

Ground Penetrating Radar ◽

Real Data ◽

Velocity Model ◽

Data Sets ◽

Inversion Problem ◽

Data Set ◽

Isolated Points ◽

Antarctic Continent ◽

Ground Penetrating ◽

The Antarctic

Most of the data acquisition in ground-penetrating radar is done along fixed-offset profiles, in which velocity is known only at isolated points in the survey area, at the locations of variable offset gathers such as a common midpoint. We have constructed sparse, heavily aliased, variable offset gathers from several fixed-offset, collinear, profiles. We interpolated those gathers to produce properly sampled counterparts, thus pushing data beyond aliasing. The interpolation methodology estimated nonstationary, adaptive, filter coefficients at all trace locations, including at the missing traces’ corresponding positions, filled with zeroed traces. This is followed by an inversion problem that uses the previously estimated filter coefficients to insert the new, interpolated, traces between the original ones. We extended this two-step strategy to data interpolation by employing a device in which we used filter coefficients from a denser variable offset gather to interpolate the missing traces on a few independently constructed gathers. We applied the methodology on synthetic and real data sets, the latter acquired in the interior of the Antarctic continent. The variable-offset interpolated data opened the door to prestack processing, making feasible the production of a prestack time migrated section and a 2D velocity model for the entire profile. Notwithstanding, we have used a data set obtained in Antarctica; there is no reason the same methodology could not be used somewhere else.

Download Full-text

A multi-decade record of high-quality fCO2 data in version 3 of the Surface Ocean CO2 Atlas (SOCAT)

10.5194/essd-2016-15 ◽

2016 ◽

Cited By ~ 6

Author(s):

Dorothee C. E. Bakker ◽

Benjamin Pfeil ◽

Camilla S. Landa ◽

Nicolas Metzl ◽

Kevin M. O'Brien ◽

...

Keyword(s):

Carbon Dioxide ◽

Surface Water ◽

Data Collection ◽

Data Availability ◽

Data Sets ◽

Science Data ◽

High Quality ◽

Data Set ◽

Surface Ocean ◽

Biogeochemical Models

Abstract. The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.5 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.4 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) "Living Data" publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014).

Download Full-text

The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions

Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set

A Support Based Initialization Algorithm for Categorical Data Clustering

From pole to pole: 33 years of physical oceanography onboard R/V <i>Polarstern</i>

WHATEVER YOU WANT: INCONSISTENT RESULTS IS THE RULE, NOT THE EXCEPTION, IN THE STUDY OF PRIMATE BRAIN EVOLUTION

Extended Odd Fréchet-G Family of Distributions

Open-Source Data Collection and Data Sets for Activity Recognition in Smart Homes

Monofractal or multifractal: a case study of spatial distribution of mining-induced seismic activity

A Growth Model for Multilevel Ordinal Data

Interpolating wide-aperture ground-penetrating radar beyond aliasing

A multi-decade record of high-quality fCO<sub>2</sub> data in version 3 of the Surface Ocean CO<sub>2</sub> Atlas (SOCAT)

Export Citation Format