scholarly journals Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set

2021 ◽  
Vol 4 (1) ◽  
pp. 251524592092800
Author(s):  
Erin M. Buchanan ◽  
Sarah E. Crain ◽  
Ari L. Cunningham ◽  
Hannah R. Johnson ◽  
Hannah Stash ◽  
...  

As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.

2019 ◽  
Author(s):  
Erin Michelle Buchanan ◽  
Sarah E Crain ◽  
Ari L. Cunningham ◽  
Hannah Rose Johnson ◽  
Hannah Elyse Stash ◽  
...  

As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand its contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a dataset. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search engine indexing to reach a broader audience of interested parties. This tutorial first explains the terminology and standards surrounding data dictionaries and codebooks. We then present a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared dataset accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we explain how to use freely available web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable (FAIR; Wilkinson et al., 2016).


2017 ◽  
Vol 9 (1) ◽  
pp. 211-220 ◽  
Author(s):  
Amelie Driemel ◽  
Eberhard Fahrbach ◽  
Gerd Rohardt ◽  
Agnieszka Beszczynska-Möller ◽  
Antje Boetius ◽  
...  

Abstract. Measuring temperature and salinity profiles in the world's oceans is crucial to understanding ocean dynamics and its influence on the heat budget, the water cycle, the marine environment and on our climate. Since 1983 the German research vessel and icebreaker Polarstern has been the platform of numerous CTD (conductivity, temperature, depth instrument) deployments in the Arctic and the Antarctic. We report on a unique data collection spanning 33 years of polar CTD data. In total 131 data sets (1 data set per cruise leg) containing data from 10 063 CTD casts are now freely available at doi:10.1594/PANGAEA.860066. During this long period five CTD types with different characteristics and accuracies have been used. Therefore the instruments and processing procedures (sensor calibration, data validation, etc.) are described in detail. This compilation is special not only with regard to the quantity but also the quality of the data – the latter indicated for each data set using defined quality codes. The complete data collection includes a number of repeated sections for which the quality code can be used to investigate and evaluate long-term changes. Beginning with 2010, the salinity measurements presented here are of the highest quality possible in this field owing to the introduction of the OPTIMARE Precision Salinometer.


Author(s):  
Mark F. St. John ◽  
Woodrow Gustafson ◽  
April Martin ◽  
Ronald A. Moore ◽  
Christopher A. Korkos

Enterprises share a wide variety of data with different partners. Tracking the risks and benefits of this data sharing is important for avoiding unwarranted risks of data exploitation. Data sharing risk can be characterized as a combination of trust in data sharing partners to not exploit shared data and the sensitivity, or potential for harm, of the data. Data sharing benefits can be characterized as the value likely to accrue to the enterprise from sharing the data by making the enterprise’s objectives more likely to succeed. We developed a risk visualization concept called a risk surface to support users monitoring for high risks and poor risk-benefit trade-offs. The risk surface design was evaluated in a series of two focus groups conducted with human factors professionals. Across the two studies, the design was improved and ultimately rated as highly useful. A risk surface needs to 1) convey which data, as joined data sets, are shared with which partners, 2) convey the degree of risk due to sharing that data, 3) convey the benefits of the data sharing and the trade-off between risk and benefits, and 4) be easy to scan at scale, since enterprises are likely to share many different types of data with many different partners.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 879 ◽  
Author(s):  
Uwe Köckemann ◽  
Marjan Alirezaie ◽  
Jennifer Renoux ◽  
Nicolas Tsiftes ◽  
Mobyen Uddin Ahmed ◽  
...  

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.


BMJ Open ◽  
2016 ◽  
Vol 6 (10) ◽  
pp. e011784 ◽  
Author(s):  
Anisa Rowhani-Farid ◽  
Adrian G Barnett

ObjectiveTo quantify data sharing trends and data sharing policy compliance at the British Medical Journal (BMJ) by analysing the rate of data sharing practices, and investigate attitudes and examine barriers towards data sharing.DesignObservational study.SettingThe BMJ research archive.Participants160 randomly sampled BMJ research articles from 2009 to 2015, excluding meta-analysis and systematic reviews.Main outcome measuresPercentages of research articles that indicated the availability of their raw data sets in their data sharing statements, and those that easily made their data sets available on request.Results3 articles contained the data in the article. 50 out of 157 (32%) remaining articles indicated the availability of their data sets. 12 used publicly available data and the remaining 38 were sent email requests to access their data sets. Only 1 publicly available data set could be accessed and only 6 out of 38 shared their data via email. So only 7/157 research articles shared their data sets, 4.5% (95% CI 1.8% to 9%). For 21 clinical trials bound by the BMJ data sharing policy, the per cent shared was 24% (8% to 47%).ConclusionsDespite the BMJ's strong data sharing policy, sharing rates are low. Possible explanations for low data sharing rates could be: the wording of the BMJ data sharing policy, which leaves room for individual interpretation and possible loopholes; that our email requests ended up in researchers spam folders; and that researchers are not rewarded for sharing their data. It might be time for a more effective data sharing policy and better incentives for health and medical researchers to share their data.


2016 ◽  
Author(s):  
Dorothee C. E. Bakker ◽  
Benjamin Pfeil ◽  
Camilla S. Landa ◽  
Nicolas Metzl ◽  
Kevin M. O'Brien ◽  
...  

Abstract. The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.5 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.4 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) "Living Data" publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014).


Author(s):  
B. Dukai ◽  
R. Peters ◽  
S. Vitalis ◽  
J. van Liempt ◽  
J. Stoter

Abstract. Fully automated reconstruction of high-detail building models on a national scale is challenging. It raises a set of problems that are seldom found when processing smaller areas, single cities. Often there is no reference, ground truth available to evaluate the quality of the reconstructed models. Therefore, only relative quality metrics are computed, comparing the models to the source data sets. In the paper we present a set of relative quality metrics that we use for assessing the quality of 3D building models, that were reconstructed in a fully automated process, in Levels of Detail 1.2, 1.3, 2.2 for the whole of the Netherlands. The source data sets for the reconstruction are the Dutch Building and Address Register (BAG) and the National Height Model (AHN). The quality assessment is done by comparing the building models to these two data sources. The work presented in this paper lays the foundation for future research on the quality control and management of automated building reconstruction. Additionally, it serves as an important step in our ongoing effort for a fully automated building reconstruction method of high-detail, high-quality models.


Data ◽  
2019 ◽  
Vol 4 (1) ◽  
pp. 26 ◽  
Author(s):  
Collin Gros ◽  
Jeremy Straub

Facial recognition, as well as other types of human recognition, have found uses in identification, security, and learning about behavior, among other uses. Because of the high cost of data collection for training purposes, logistical challenges and other impediments, mirroring images has frequently been used to increase the size of data sets. However, while these larger data sets have shown to be beneficial, their comparative level of benefit to the data collection of similar data has not been assessed. This paper presented a data set collected and prepared for this and related research purposes. The data set included both non-occluded and occluded data for mirroring assessment.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e18725-e18725
Author(s):  
Ravit Geva ◽  
Barliz Waissengrin ◽  
Dan Mirelman ◽  
Felix Bokstein ◽  
Deborah T. Blumenthal ◽  
...  

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]


2016 ◽  
Vol 12 (2) ◽  
pp. 182-203 ◽  
Author(s):  
Joke H. van Velzen

There were two purposes for this mixed methods study: to investigate (a) the realistic meaning of awareness and understanding as the underlying constructs of general knowledge of the learning process and (b) a procedure for data consolidation. The participants were 11th-grade high school and first-year university students. Integrated data collection and data transformation provided for positive but small correlations between awareness and understanding. A comparison of the created combined and integrated new data sets showed that the integrated data set provided for an expected statistically significant outcome, which was in line with the participants’ developmental difference. This study can contribute to the mixed methods research because it proposes a procedure for data consolidation and a new research design.


Sign in / Sign up

Export Citation Format

Share Document