scholarly journals Cloud-Native Repositories for Big Scientific Data

Author(s):  
Ryan Abernathey ◽  
Tom Augspurger ◽  
Anderson Banihirwe ◽  
Charles C Blackmon-Luca ◽  
Timothy J Crone ◽  
...  

Scientific data has traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow towards the petabyte scale. A “cloud-native data repository,” as defined in this paper, offers several advantages over traditional data repositories—performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing’s full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.

2018 ◽  
Vol 42 (1) ◽  
pp. 124-142 ◽  
Author(s):  
Youngseek Kim ◽  
Seungahn Nah

Purpose The purpose of this paper is to examine how data reuse experience, attitudinal beliefs, social norms, and resource factors influence internet researchers to share data with other researchers outside their teams. Design/methodology/approach An online survey was conducted to examine the extent to which data reuse experience, attitudinal beliefs, social norms, and resource factors predicted internet researchers’ data sharing intentions and behaviors. The theorized model was tested using a structural equation modeling technique to analyze a total of 201 survey responses from the Association of Internet Researchers mailing list. Findings Results show that data reuse experience significantly influenced participants’ perception of benefit from data sharing and participants’ norm of data sharing. Belief structures regarding data sharing, including perceived career benefit and risk, and perceived effort, had significant associations with attitude toward data sharing, leading internet researchers to have greater data sharing intentions and behavior. The results also reveal that researchers’ norms for data sharing had a direct effect on data sharing intention. Furthermore, the results indicate that, while the perceived availability of data repository did not yield a positive impact on data sharing intention, it has a significant, direct, positive impact on researchers’ data sharing behaviors. Research limitations/implications This study validated its novel theorized model based on the theory of planned behavior (TPB). The study showed a holistic picture of how different data sharing factors, including data reuse experience, attitudinal beliefs, social norms, and data repositories, influence internet researchers’ data sharing intentions and behaviors. Practical implications Data reuse experience, attitude toward and norm of data sharing, and the availability of data repository had either direct or indirect influence on internet researchers’ data sharing behaviors. Thus, professional associations, funding agencies, and academic institutions alike should promote academic cultures that value data sharing in order to create a virtuous cycle of reciprocity and encourage researchers to have positive attitudes toward/norms of data sharing; these cultures should be strengthened by the strong support of data repositories. Originality/value In line with prior scholarship concerning scientific data sharing, this study of internet researchers offers a map of scientific data sharing intentions and behaviors by examining the impacts of data reuse experience, attitudinal beliefs, social norms, and data repositories together.


2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Fernando Rios ◽  
Chun Ly

Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud. Methods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling. Results: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed. Conclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.


2020 ◽  
Vol 22 (5/6) ◽  
pp. 389-411
Author(s):  
Muhammad Al-Abdullah ◽  
Izzat Alsmadi ◽  
Ruwaida AlAbdullah ◽  
Bernie Farkas

Purpose The paper posits that a solution for businesses to use privacy-friendly data repositories for its customers’ data is to change from the traditional centralized repository to a trusted, decentralized data repository. Blockchain is a technology that provides such a data repository. However, the European Union’s General Data Protection Regulation (GDPR) assumed a centralized data repository, and it is commonly argued that blockchain technology is not usable. This paper aims to posit a framework for adopting a blockchain that follows the GDPR. Design/methodology/approach The paper uses the Levy and Ellis’ narrative review of literature methodology, which is based on constructivist theory posited by Lincoln and Guba. Using five information systems and computer science databases, the researchers searched for studies using the keywords GDPR and blockchain, using a forward and backward search technique. The search identified a corpus of 416 candidate studies, from which the researchers applied pre-established criteria to select 39 studies. The researchers mined this corpus for concepts, which they clustered into themes. Using the accepted computer science practice of privacy by design, the researchers combined the clustered themes into the paper’s posited framework. Findings The paper posits a framework that provides architectural tactics for designing a blockchain that follows GDPR to enhance privacy. The framework explicitly addresses the challenges of GDPR compliance using the unimagined decentralized storage of personal data. The framework addresses the blockchain–GDPR tension by establishing trust between a business and its customers vis-à-vis storing customers’ data. The trust is established through blockchain’s capability of providing the customer with private keys and control over their data, e.g. processing and access. Research limitations/implications The paper provides a framework that demonstrates that blockchain technology can be designed for use in GDPR compliant solutions. In using the framework, a blockchain-based solution provides the ability to audit and monitor privacy measures, demonstrates a legal justification for processing activities, incorporates a data privacy policy, provides a map for data processing and ensures security and privacy awareness among all actors. The research is limited to a focus on blockchain–GDPR compliance; however, future research is needed to investigate the use of the framework in specific domains. Practical implications The paper posits a framework that identifies the strategies and tactics necessary for GDPR compliance. Practitioners need to compliment the framework with rigorous privacy risk management, i.e. conducting a privacy risk analysis, identifying strategies and tactics to address such risks and preparing a privacy impact assessment that enhances accountability and transparency of a blockchain. Originality/value With the increasingly strategic use of data by businesses and the contravening growth of data privacy regulation, alternative technologies could provide businesses with a means to nurture trust with its customers regarding collected data. However, it is commonly assumed that the decentralized approach of blockchain technology cannot be applied to this business need. This paper posits a framework that enables a blockchain to be designed that follows the GDPR; thereby, providing an alternative for businesses to collect customers’ data while ensuring the customers’ trust.


2017 ◽  
Author(s):  
Yulia Kolesnikova ◽  
Adam Lathrop ◽  
Bree Norlander ◽  
An Yan

Few research studies have quantitatively analyzed metadata elements associated with scientific data reuse. By using metadata and dataset download rates from the National Snow and Ice Data Center, we address whether there are key indicators in data repository metadata that show a statistically significant correlation with the download count of a dataset and whether we can predict data reuse using machine learning techniques. We used the download rate by unique IP addresses for individual datasets as our dependent variable and as a proxy for data reuse. Our analysis shows that the following metadata elements in NSIDC datasets are positively correlated with download rates: year of citation, number of data formats, number of contributors, number of platforms, number of spatial coverage areas, number of locations, and number of keywords. Our results are applicable to researchers and professionals working with data and add to the small body of work addressing metadata best practices for increasing discovery of data.


2021 ◽  
Author(s):  
Shelley Stall ◽  
Helen Glaves ◽  
Brooks Hanson ◽  
Kerstin Lehnert ◽  
Erin Robinson ◽  
...  

<p>The Earth, space, and environmental sciences have made significant progress in awareness and implementation of policy and practice around the sharing of data, software, and samples.  In specific, the Coalition for Publishing Data in the Earth and Space Sciences (https://copdess.org/) brings together data repositories and journals to discuss and address common challenges in support of more transparent and discoverable research and the supporting data.  Since the inception of COPDESS in 2014 and the completion of the Enabling FAIR Data Project in 2019, work has continued on the improvement of availability statements for data and software as well as corresponding citations.  </p><p>As the broad research community continues to make progress around data and software management and sharing, COPDESS is focused on several key efforts. These include 1) supporting authors in identifying the most appropriate data repository for preservation, 2) validating that all manuscripts have data and software availability statements, 3) ensuring data and software citations are properly included and linked to the publication to support credit, 4) encouraging adoption of best practices. </p><p>We will review the status of these current efforts around data and software sharing, the important role that repositories and researchers have to ensure that automated credit and attribution elements are in place, and the recent publications on software citation guidance from the FORCE11 Software Implementation Working Group.</p>


2017 ◽  
Vol 1 (2) ◽  
pp. 115-123 ◽  
Author(s):  
Yi Shen

Abstract Currently, we are witnessing the emergence and abundance of many different data repositories and archival systems for scientific data discovery, use, and analysis. With the burgeoning of available data-sharing platforms, this study addresses how scientists working in the fields of natural resources and environmental sciences navigate these diverse data sources, what their concerns and value propositions are toward multiple data discovery channels, and most importantly, how they perceive the characteristics and compare the functionalities of different types of data repository systems. Through a user community research of domain scientists on their data use dynamics and insights, this research provides strategies and discusses ideas on how to leverage these different platforms. Furthermore, it proposes a top–down, novel approach to the processes of searching, browsing, and visualizing for the dynamic exploration of environmental data.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Lisa-Marie Ohle ◽  
David Ellenberger ◽  
Peter Flachenecker ◽  
Tim Friede ◽  
Judith Haas ◽  
...  

AbstractIn 2001, the German Multiple Sclerosis Society, facing lack of data, founded the German MS Registry (GMSR) as a long-term data repository for MS healthcare research. By the establishment of a network of participating neurological centres of different healthcare sectors across Germany, GMSR provides observational real-world data on long-term disease progression, sociodemographic factors, treatment and the healthcare status of people with MS. This paper aims to illustrate the framework of the GMSR. Structure, design and data quality processes as well as collaborations of the GMSR are presented. The registry’s dataset, status and results are discussed. As of 08 January 2021, 187 centres from different healthcare sectors participate in the GMSR. Following its infrastructure and dataset specification upgrades in 2014, more than 196,000 visits have been recorded relating to more than 33,000 persons with MS (PwMS). The GMSR enables monitoring of PwMS in Germany, supports scientific research projects, and collaborates with national and international MS data repositories and initiatives. With its recent pharmacovigilance extension, it aligns with EMA recommendations and helps to ensure early detection of therapy-related safety signals.


Author(s):  
Johannes Hubert Stigler ◽  
Elisabeth Steiner

Research data repositories and data centres are becoming more and more important as infrastructures in academic research. The article introduces the Humanities’ research data repository GAMS, starting with the system architecture to preservation policy and content policy. Challenges of data centres and repositories and the general and domain-specific approaches and solutions are outlined. Special emphasis lies on the sustainability and long-term perspective of such infrastructures, not only on the technical but above all on the organisational and financial level.


2021 ◽  
Vol 16 (1) ◽  
pp. 21
Author(s):  
Chung-Yi Hou ◽  
Matthew S. Mayernik

For research data repositories, web interfaces are usually the primary, if not the only, method that data users have to interact with repository systems. Data users often search, discover, understand, access, and sometimes use data directly through repository web interfaces. Given that sub-par user interfaces can reduce the ability of users to locate, obtain, and use data, it is important to consider how repositories’ web interfaces can be evaluated and improved in order to ensure useful and successful user interactions. This paper discusses how usability assessment techniques are being applied to improve the functioning of data repository interfaces at the National Center for Atmospheric Research (NCAR). At NCAR, a new suite of data system tools is being developed and collectively called the NCAR Digital Asset Services Hub (DASH). Usability evaluation techniques have been used throughout the NCAR DASH design and implementation cycles in order to ensure that the systems work well together for the intended user base. By applying user study, paper prototype, competitive analysis, journey mapping, and heuristic evaluation, the NCAR DASH Search and Repository experiences provide examples for how data systems can benefit from usability principles and techniques. Integrating usability principles and techniques into repository system design and implementation workflows helps to optimize the systems’ overall user experience.


2017 ◽  
Vol 12 (1) ◽  
pp. 88-105 ◽  
Author(s):  
Sünje Dallmeier-Tiessen ◽  
Varsha Khodiyar ◽  
Fiona Murphy ◽  
Amy Nurnberger ◽  
Lisa Raymond ◽  
...  

The data curation community has long encouraged researchers to document collected research data during active stages of the research workflow, to provide robust metadata earlier, and support research data publication and preservation. Data documentation with robust metadata is one of a number of steps in effective data publication. Data publication is the process of making digital research objects ‘FAIR’, i.e. findable, accessible, interoperable, and reusable; attributes increasingly expected by research communities, funders and society. Research data publishing workflows are the means to that end. Currently, however, much published research data remains inconsistently and inadequately documented by researchers. Documentation of data closer in time to data collection would help mitigate the high cost that repositories associate with the ingest process. More effective data publication and sharing should in principle result from early interactions between researchers and their selected data repository. This paper describes a short study undertaken by members of the Research Data Alliance (RDA) and World Data System (WDS) working group on Publishing Data Workflows. We present a collection of recent examples of data publication workflows that connect data repositories and publishing platforms with research activity ‘upstream’ of the ingest process. We re-articulate previous recommendations of the working group, to account for the varied upstream service components and platforms that support the flow of contextual and provenance information downstream. These workflows should be open and loosely coupled to support interoperability, including with preservation and publication environments. Our recommendations aim to stimulate further work on researchers’ views of data publishing and the extent to which available services and infrastructure facilitate the publication of FAIR data. We also aim to stimulate further dialogue about, and definition of, the roles and responsibilities of research data services and platform providers for the ‘FAIRness’ of research data publication workflows themselves.


Sign in / Sign up

Export Citation Format

Share Document