The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions

2020 ◽  
Vol 15 (5) ◽  
pp. 1158-1177
Author(s):  
Jenna A. Harder

When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.

2021 ◽  
Vol 4 (1) ◽  
pp. 251524592092800
Author(s):  
Erin M. Buchanan ◽  
Sarah E. Crain ◽  
Ari L. Cunningham ◽  
Hannah R. Johnson ◽  
Hannah Stash ◽  
...  

As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


2017 ◽  
Vol 9 (1) ◽  
pp. 211-220 ◽  
Author(s):  
Amelie Driemel ◽  
Eberhard Fahrbach ◽  
Gerd Rohardt ◽  
Agnieszka Beszczynska-Möller ◽  
Antje Boetius ◽  
...  

Abstract. Measuring temperature and salinity profiles in the world's oceans is crucial to understanding ocean dynamics and its influence on the heat budget, the water cycle, the marine environment and on our climate. Since 1983 the German research vessel and icebreaker Polarstern has been the platform of numerous CTD (conductivity, temperature, depth instrument) deployments in the Arctic and the Antarctic. We report on a unique data collection spanning 33 years of polar CTD data. In total 131 data sets (1 data set per cruise leg) containing data from 10 063 CTD casts are now freely available at doi:10.1594/PANGAEA.860066. During this long period five CTD types with different characteristics and accuracies have been used. Therefore the instruments and processing procedures (sensor calibration, data validation, etc.) are described in detail. This compilation is special not only with regard to the quantity but also the quality of the data – the latter indicated for each data set using defined quality codes. The complete data collection includes a number of repeated sections for which the quality code can be used to investigate and evaluate long-term changes. Beginning with 2010, the salinity measurements presented here are of the highest quality possible in this field owing to the introduction of the OPTIMARE Precision Salinometer.


2018 ◽  
Author(s):  
Andreas Wartel ◽  
Patrik Lindenfors ◽  
Johan Lind

AbstractPrimate brains differ in size and architecture. Hypotheses to explain this variation are numerous and many tests have been carried out. However, after body size has been accounted for there is little left to explain. The proposed explanatory variables for the residual variation are many and covary, both with each other and with body size. Further, the data sets used in analyses have been small, especially in light of the many proposed predictors. Here we report the complete list of models that results from exhaustively combining six commonly used predictors of brain and neocortex size. This provides an overview of how the output from standard statistical analyses changes when the inclusion of different predictors is altered. By using both the most commonly tested brain data set and a new, larger data set, we show that the choice of included variables fundamentally changes the conclusions as to what drives primate brain evolution. Our analyses thus reveal why studies have had troubles replicating earlier results and instead have come to such different conclusions. Although our results are somewhat disheartening, they highlight the importance of scientific rigor when trying to answer difficult questions. It is our position that there is currently no empirical justification to highlight any particular hypotheses, of those adaptive hypotheses we have examined here, as the main determinant of primate brain evolution.


2018 ◽  
Vol 2018 ◽  
pp. 1-12 ◽  
Author(s):  
Suleman Nasiru

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 879 ◽  
Author(s):  
Uwe Köckemann ◽  
Marjan Alirezaie ◽  
Jennifer Renoux ◽  
Nicolas Tsiftes ◽  
Mobyen Uddin Ahmed ◽  
...  

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.


1994 ◽  
Vol 1 (2/3) ◽  
pp. 182-190 ◽  
Author(s):  
M. Eneva

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.


2005 ◽  
Vol 30 (4) ◽  
pp. 369-396 ◽  
Author(s):  
Eisuke Segawa

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.


Geophysics ◽  
2015 ◽  
Vol 80 (2) ◽  
pp. H13-H22 ◽  
Author(s):  
Saulo S. Martins ◽  
Jandyr M. Travassos

Most of the data acquisition in ground-penetrating radar is done along fixed-offset profiles, in which velocity is known only at isolated points in the survey area, at the locations of variable offset gathers such as a common midpoint. We have constructed sparse, heavily aliased, variable offset gathers from several fixed-offset, collinear, profiles. We interpolated those gathers to produce properly sampled counterparts, thus pushing data beyond aliasing. The interpolation methodology estimated nonstationary, adaptive, filter coefficients at all trace locations, including at the missing traces’ corresponding positions, filled with zeroed traces. This is followed by an inversion problem that uses the previously estimated filter coefficients to insert the new, interpolated, traces between the original ones. We extended this two-step strategy to data interpolation by employing a device in which we used filter coefficients from a denser variable offset gather to interpolate the missing traces on a few independently constructed gathers. We applied the methodology on synthetic and real data sets, the latter acquired in the interior of the Antarctic continent. The variable-offset interpolated data opened the door to prestack processing, making feasible the production of a prestack time migrated section and a 2D velocity model for the entire profile. Notwithstanding, we have used a data set obtained in Antarctica; there is no reason the same methodology could not be used somewhere else.


2016 ◽  
Author(s):  
Dorothee C. E. Bakker ◽  
Benjamin Pfeil ◽  
Camilla S. Landa ◽  
Nicolas Metzl ◽  
Kevin M. O'Brien ◽  
...  

Abstract. The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.5 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.4 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) "Living Data" publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014).


Sign in / Sign up

Export Citation Format

Share Document