Using a linked table-based structure to encode self-describing multiparameter spatiotemporal data

Multiparameter data with both spatial and temporal components are critical to advancing the state of environmental science. These data and data collected in the future are most useful when compared with each other and analyzed together, which is often inhibited by inconsistent data formats and a lack of structured documentation provided by researchers and (or) data repositories. In this paper we describe a linked table-based structure that encodes multiparameter spatiotemporal data and their documentation that is both flexible (able to store a wide variety of data sets) and usable (can easily be viewed, edited, and converted to plottable formats). The format is a collection of five tables (Data, Locations, Params, Data Sets, and Columns), on which restrictions are placed to ensure data are represented consistently from multiple sources. These tables can be stored in a variety of ways including spreadsheet files, comma-separated value (CSV) files, JavaScript object notation (JSON) files, databases, or objects in a software environment such as R or Python. A toolkit for users of R statistical software was also developed to facilitate converting data to and from the data format. We have used this format to combine data from multiple sources with minimal metadata loss and to effectively archive and communicate the results of spatiotemporal studies. We believe that this format and associated discussion of data and data storage will facilitate increased synergies between past, present, and future data sets in the environmental science community.

Download Full-text

Molecular Structure Determination on the Grid

Grid and Cloud Computing ◽

10.4018/978-1-4666-0879-5.ch406 ◽

2012 ◽

pp. 862-880

Author(s):

Russ Miller ◽

Charles Weeks

Keyword(s):

Molecular Structure ◽

New York ◽

Data Storage ◽

New York State ◽

Open Science ◽

Data Sets ◽

Distributed Resources ◽

Data Repositories ◽

Computational Resources ◽

Open Science Grid

Grids represent an emerging technology that allows geographically- and organizationally-distributed resources (e.g., computer systems, data repositories, sensors, imaging systems, and so forth) to be linked in a fashion that is transparent to the user. The New York State Grid (NYS Grid) is an integrated computational and data grid that provides access to a wide variety of resources to users from around the world. NYS Grid can be accessed via a Web portal, where the users have access to their data sets and applications, but do not need to be made aware of the details of the data storage or computational devices that are specifically employed in solving their problems. Grid-enabled versions of the SnB and BnP programs, which implement the Shake-and-Bake method of molecular structure (SnB) and substructure (BnP) determination, respectively, have been deployed on NYS Grid. Further, through the Grid Portal, SnB has been run simultaneously on all computational resources on NYS Grid as well as on more than 1100 of the over 3000 processors available through the Open Science Grid.

Download Full-text

Towards World-class Earth and Environmental Science Research in 2030: Will Today’s Practices in Data Repositories Get Us There?

10.5194/egusphere-egu2020-22478 ◽

2020 ◽

Author(s):

Lesley Wyborn

Keyword(s):

Environmental Science ◽

Leading Edge ◽

Data Sets ◽

Complex Data ◽

Volume Data ◽

Online Data ◽

Data Repositories ◽

Domain Specific ◽

Domain Expertise ◽

Community Standards

Internationally Earth and environmental Science datasets have the potential to contribute significantly to resolving major societal challenges such as those outlined in the United Nations 2030 Sustainable Development Goals (SDGs). By 2030, we know that leading-edge computational infrastructures will be exascale (repositories, supercomputers, cloud, etc) and that these will facilitate realistic resolution of research challenges at scales and resolutions that cannot be undertaken today. Hence, by 2030, the capability for Earth and environmental science researchers to make valued contributions will depend on developing a global capacity to integrate data online from multiple distributed, heterogeneous repositories. Are we on the right path to achieve this?Today, online, data repositories are a growing part of the research infrastructure ecosystem: their number and diversity has been slowly increasing over recent years to meet the demands that traditional institutional or other generic repositories can no longer satisfy. Although more specialised repositories are available (e.g., those for petascale volume data sets and domain specific long tail, complex data sets), funding for these specialised repositories is rarely long term.Through initiatives such as the Commitment Statement from the Coalition for Publishing Data in the Earth and Space Sciences, publishers are now requiring that datasets that support a publication be curated and stored in a &#8216;trustworthy&#8217; repository that can provide a DOI and a landing page for that dataset, and if possible, can also provide some domain quality assurance to ensure that data sets are not only Findable and Accessible, but also Interoperable and Reusable. But the demand for suitable domain expertise to provide the &#8220;I&#8221; and the &#8220;R&#8221; is far exceeding what is available. As a last resort, frustrated researchers are simply depositing the datasets that support their publications into generic repositories such as Figshare and Zenodo, which simply store the file of the data: rarely are domain-specific QA/QC procedures applied to the data.&#160;These generic repositories do ensure that data is not sitting on inaccessible personal c-drives and USB drives, but the content is rarely interoperable. This can only be achieved by repositories that have the domain expertise to curate the data properly, and ensure that the data meets minimum community standards and specifications that will enable online aggregation into global reference sets. In addition, most researchers are only depositing the files that support a particular publication, and as these files can be highly processed and generalised they difficult to reuse outside of the context of the specific research publication.To achieve the ambition of Earth and environmental science datasets being reusable and interoperable and make a major contribution to the SDGs by 2030, then today we need:&#160;<ol><ol>More effort and coordination in the development of international community standards to enable technical, semantic and legal interoperability of datasets;&#160;</ol></ol><ol><ol>To ensure that publicly funded research data are also available without further manipulation or conversion to facilitate their broader reuse in scientific research particularly as by 2030 as we will also have greater computational capacity to analyse data at scales and resolutions currently not achievable.</ol></ol>&#160;

Download Full-text

The SWx TREC Integrative Space Weather Data Portal and Model/Algorithm Testbed Environment

10.5194/egusphere-egu2020-22144 ◽

2020 ◽

Author(s):

Chris Pankratz ◽

Thomas Baltzer ◽

Greg Lucas ◽

James Craft ◽

Thomas Berger ◽

...

Keyword(s):

Space Weather ◽

Space Physics ◽

Scientific Understanding ◽

Weather Data ◽

Data Sets ◽

Multiple Sources ◽

Data Repositories ◽

Research Technology ◽

Data Portal ◽

Weather Research

The&#160;Space Weather Technology, Research and Education Center (SWx TREC)&#160;is a center of excellence in cross-disciplinary research, technology, innovation, and education, intended to facilitate evolving space weather research and forecasting needs.&#160; SWx TREC facilitates research advances, innovative missions, and data and computing technologies that directly support the needs of the SWx community to advance understanding and support closure of the Research to Operations (R2O) and Operations to Research (O2R) loop.&#160;Improving our understanding and prediction of space weather requires coupled Research and Operations. SWx-TREC is working to provide new research models, applications&#160;and data for use in operational environments, improving the Research-to-Operations (R2O) pipeline.&#160; Advancement in the fundamental scientific understanding of space weather processes is also vital, requiring that researchers have convenient and effective access to a wide variety of data sets and models from multiple sources. The space weather research community, as with many scientific communities, must access data from dispersed and often uncoordinated data repositories to acquire the data necessary for the analysis and modeling efforts that advance our understanding of solar influences and space physics in the Earth&#8217;s environment. The University of Colorado (CU) is a leading institution in both producing data products and advancing the state of scientific understanding of space weather processes, and we are now hosting both an interoperable data portal providing streamlined, centralized, and event-based access to a wide variety of disparate data sets and also a community-accessible, Cloud-based testbed environment to support development, testing, transition, and use of new models, visualizations, algorithms, and forecast products. &#160;In this presentation, we will describe our community-accessible testbed environment and demonstrate the Space Weather Data Portal.

Download Full-text

Reproducible Software Environment: a tool enabling computational reproducibility in geospace sciences and facilitating collaboration

Journal of Space Weather and Space Climate ◽

10.1051/swsc/2020011 ◽

2020 ◽

Vol 10 ◽

pp. 12

Author(s):

Asti Bhatt ◽

Todd Valentic ◽

Ashton Reimer ◽

Leslie Lamarche ◽

Pablo Reyes ◽

...

Keyword(s):

Open Source ◽

Open Source Software ◽

Software Tool ◽

Software Tools ◽

Reproducible Research ◽

Multiple Sources ◽

Community Members ◽

Software Environment ◽

Data Repositories ◽

Scientific Results

The Reproducible Software Environment (Resen) is an open-source software tool enabling computationally reproducible scientific results in the geospace science community. Resen was developed as part of a larger project called the Integrated Geoscience Observatory (InGeO), which aims to help geospace researchers bring together diverse datasets from disparate instruments and data repositories, with software tools contributed by instrument providers and community members. The main goals of InGeO are to remove barriers in accessing, processing, and visualizing geospatially resolved data from multiple sources using methodologies and tools that are reproducible. The architecture of Resen combines two mainstream open source software tools, Docker and JupyterHub, to produce a software environment that not only facilitates computationally reproducible research results, but also facilitates effective collaboration among researchers. In this technical paper, we discuss some challenges for performing reproducible science and a potential solution via Resen, which is demonstrated using a case study of a geospace event. Finally we discuss how the usage of mainstream, open-source technologies seems to provide a sustainable path towards enabling reproducible science compared to proprietary and closed-source software.

Download Full-text

Molecular Structure Determination on the Grid

Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare ◽

10.4018/978-1-60566-374-6.ch017 ◽

2011 ◽

pp. 327-345

Author(s):

Russ Miller ◽

Charles Weeks

Keyword(s):

Molecular Structure ◽

New York ◽

Data Storage ◽

New York State ◽

Open Science ◽

Data Sets ◽

Distributed Resources ◽

Data Repositories ◽

Computational Resources ◽

Open Science Grid

Grids represent an emerging technology that allows geographically- and organizationally-distributed resources (e.g., compute systems, data repositories, sensors, imaging systems, and so forth) to be linked in a fashion that is transparent to the user. The New York State Grid (NYS Grid) is an integrated computational and data grid that provides access to a wide variety of resources to users from around the world. NYS Grid can be accessed via a Web portal, where the users have access to their data sets and applications, but do not need to be made aware of the details of the data storage or computational devices that are specifically employed in solving their problems. Grid-enabled versions of the SnB and BnP programs, which implement the Shake-and-Bake method of molecular structure (SnB) and substructure (BnP) determination, respectively, have been deployed on NYS Grid. Further, through the Grid Portal, SnB has been run simultaneously on all computational resources on NYS Grid as well as on more than 1100 of the over 3000 processors available through the Open Science Grid.

Download Full-text

Navigating Legal and Ethical Requirements for Multi-Stakeholder Data Access Requests Through Harmonized Documentation

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.746 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Jessica Galo ◽

Tim Choi ◽

Maria Kim-Bautista

Keyword(s):

Data Storage ◽

Data Transfer ◽

Population Data ◽

Data Access ◽

Data Sets ◽

Multiple Sources ◽

Multiple Data ◽

Access Request ◽

Multiple Data Sets ◽

Multi Stakeholder

IntroductionResearch increasingly involves linking data from multiple sources, including data collected by researchers. This creates complexity because data providers often have differing policies and requirements for data access. Harmonization of processes requires resources, especially as new data providers are added, and needs to be prioritized appropriately. ObjectivesOur objectives were to: 1. understand the challenges encountered by researchers interested in collecting data and/or linking multiple data sets; and 2. outline and evaluate Population Data (PopData) BC’s efforts into harmonizing documentation and processes to address these challenges. With this information, we aim to better support research and streamline the data access request process. ApproachWe compared data access timelines of projects that did and did not utilize harmonized templates, including consent forms, data access request forms, and research agreements. We then identified the challenges arising from non-harmonized requirements including their number and complexity, and developed priorities for action. ResultsWhile existing consent form templates provided the ethics board-required language to support the collection of researcher-collected data, they lacked the text requirements of the administrative data stewards/providers. These text deficiencies slow down the data access request process, affect data provider workflow, and can be associated with researcher costs to re-consent. To address these gaps, harmonized consent templates were developed and finalized in November 2017. These templates included the data steward text requirements on governance, data sets, data transfer, data storage, and withdrawal. Non-harmonized data access request forms and research agreements varied in format and detail and resulted in coordination challenges and delays. A harmonized form was developed to capture key information required by all stakeholders. Research agreement harmonization discussions are underway. Impact evaluation is ongoing. Conclusion/ImplicationsThe complexity multi-stakeholder dataset research need not extend to the data access process. Coordinated requirements and harmonized documentation reduce the burden on all stakeholders including researchers, ethics boards, and data stewards and improve the project timelines.

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Improvements for research data repositories: The case of text spam

Journal of Information Science ◽

10.1177/0165551521998636 ◽

2021 ◽

pp. 016555152199863

Author(s):

Ismael Vázquez ◽

María Novo-Lourés ◽

Reyes Pavón ◽

Rosalía Laza ◽

José Ramón Méndez ◽

...

Keyword(s):

Web Application ◽

Research Data ◽

Data Sets ◽

Data Repositories ◽

Software Applications ◽

Public Data ◽

Protection Mechanisms ◽

Experimental Protocols ◽

Learning Research ◽

Processing Steps

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

Download Full-text

Functional Applications of Future Data Storage Devices

Advanced Electronic Materials ◽

10.1002/aelm.202001181 ◽

2021 ◽

pp. 2001181

Author(s):

Jia‐Qin Yang ◽

Ye Zhou ◽

Su‐Ting Han

Keyword(s):

Data Storage ◽

Storage Devices ◽

Functional Applications ◽

Future Data

Download Full-text

Protest Event Analysis: Developing a Semiautomated NLP Approach

American Behavioral Scientist ◽

10.1177/00027642211021650 ◽

2021 ◽

pp. 000276422110216

Author(s):

Jasmine Lorenzini ◽

Hanspeter Kriesi ◽

Peter Makarov ◽

Bruno Wüest

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Event Analysis ◽

Validity And Reliability ◽

Multiple Sources ◽

Social Scientists ◽

Extensive Discussion ◽

Computational Tools ◽

News Sources

Protest event analysis is a key method to study social movements, allowing to systematically analyze protest events over time and space. However, the manual coding of protest events is time-consuming and resource intensive. Recently, advances in automated approaches offer opportunities to code multiple sources and create large data sets that span many countries and years. However, too often the procedures used are not discussed in details and, therefore, researchers have a limited capacity to assess the validity and reliability of the data. In addition, many researchers highlighted biases associated with the study of protest events that are reported in the news. In this study, we ask how social scientists can build on electronic news databases and computational tools to create reliable PEA data that cover a large number of countries over a long period of time. We provide a detailed description our semiautomated approach and we offer an extensive discussion of potential biases associated with the study of protest events identified in international news sources.

Download Full-text