scholarly journals Curation of Laboratory Experimental Data as Part of the Overall Data Lifecycle

2008 ◽  
Vol 3 (1) ◽  
pp. 44-62 ◽  
Author(s):  
Jeremy Frey

The explosion in the production of scientific data in recent years is placing strains upon conventional systems supporting integration, analysis, interpretation and dissemination of data and thus constraining the whole scientific process. Support for handling large quantities of diverse information can be provided by e-Science methodologies and the cyber-infrastructure that enables collaborative handling of such data. Regard needs to be taken of the whole process involved in scientific discovery. This includes the consideration of the requirements of the users and consumers further down the information chain and what they might ideally prefer to impose on the generators of those data. As the degree of digital capture in the laboratory increases, it is possible to improve the automatic acquisition of the ‘context of the data’ as well as the data themselves. This process provides an opportunity for the data creators to ensure that many of the problems they often encounter in later stages are avoided. We wish to elevate curation to an operation to be considered by the laboratory scientist as part of good laboratory practice, not a procedure of concern merely to the few specialising in archival processes. Designing curation into experiments is an effective solution to the provision of high-quality metadata that leads to better, more re-usable data and to better science.

2014 ◽  
Vol 22 (2) ◽  
pp. 173-185 ◽  
Author(s):  
Eli Dart ◽  
Lauren Rotman ◽  
Brian Tierney ◽  
Mary Hester ◽  
Jason Zurawski

The ever-increasing scale of scientific data has become a significant challenge for researchers that rely on networks to interact with remote computing systems and transfer results to collaborators worldwide. Despite the availability of high-capacity connections, scientists struggle with inadequate cyberinfrastructure that cripples data transfer performance, and impedes scientific progress. The ScienceDMZparadigm comprises a proven set of network design patterns that collectively address these problems for scientists. We explain the Science DMZ model, including network architecture, system configuration, cybersecurity, and performance tools, that creates an optimized network environment for science. We describe use cases from universities, supercomputing centers and research laboratories, highlighting the effectiveness of the Science DMZ model in diverse operational settings. In all, the Science DMZ model is a solid platform that supports any science workflow, and flexibly accommodates emerging network technologies. As a result, the Science DMZ vastly improves collaboration, accelerating scientific discovery.


Author(s):  
Francesco Gagliardi

The author introduces a machine learning system for cluster analysis to take on the problem of syndrome discovery in the clinical domain. A syndrome is a set of typical clinical features (a prototype) that appear together often enough to suggest they may represent a single, unknown, disease. The discovery of syndromes and relative taxonomy formation is therefore the critical early phase of the process of scientific discovery in the medical domain. The system proposed discovers syndromes following Eleanor Rosch’s prototype theory on how the human mind categorizes and forms taxonomies, and thereby to understand how humans perform these activities and to automate or assist the process of scientific discovery. The system implemented can be considered a scientific discovery support system as it can discover unknown syndromes to the advantage of subsequent clinical practices and research activities.


2021 ◽  
Author(s):  
Josh Moore ◽  
Chris Allan ◽  
Sebastien Besson ◽  
Jean-marie Burel ◽  
Erin Diel ◽  
...  

Biological imaging is one of the most innovative fields in the modern biological sciences. New imaging modalities, probes, and analysis tools appear every few months and often prove decisive for enabling new directions in scientific discovery. One feature of this dynamic field is the need to capture new types of data and data structures. While there is a strong drive to make scientific data Findable, Accessible, Interoperable and Reproducible (FAIR, 1), the rapid rate of innovation in imaging impedes the unification and adoption of standardized data formats. Despite this, the opportunities for sharing and integrating bioimaging data and, in particular, linking these data to other "omics" datasets have never been greater; therefore, to every extent possible, increasing "FAIRness" of bioimaging data is critical for maximizing scientific value, as well as for promoting openness and integrity. In the absence of a common, FAIR format, two approaches have emerged to provide access to bioimaging data: translation and conversion. On-the-fly translation produces a transient representation of bioimage metadata and binary data but must be repeated on each use. In contrast, conversion produces a permanent copy of the data, ideally in an open format that makes the data more accessible and improves performance and parallelization in reads and writes. Both approaches have been implemented successfully in the bioimaging community but both have limitations. At cloud-scale, those shortcomings limit scientific analysis and the sharing of results. We introduce here next-generation file formats (NGFF) as a solution to these challenges.


2011 ◽  
Vol 12 (1) ◽  
Author(s):  
Elizabeth K Nelson ◽  
Britt Piehler ◽  
Josh Eckels ◽  
Adam Rauch ◽  
Matthew Bellew ◽  
...  

2017 ◽  
Author(s):  
Chengyu Liu ◽  
Wei Wang

AbstractMachine learning algorithms such as linear regression, SVM and neural network have played an increasingly important role in the process of scientific discovery. However, none of them is both interpretable and accurate on nonlinear datasets. Here we present contextual regression, a method that joins these two desirable properties together using a hybrid architecture of neural network embedding and dot product layer. We demonstrate its high prediction accuracy and sensitivity through the task of predictive feature selection on a simulated dataset and the application of predicting open chromatin sites in the human genome. On the simulated data, our method achieved high fidelity recovery of feature contributions under random noise levels up to ±200%. On the open chromatin dataset, the application of our method not only outperformed the state of the art method in terms of accuracy, but also unveiled two previously unfound open chromatin related histone marks. Our method fills in the gap of accurate and interpretable nonlinear modeling in scientific data mining tasks.


2017 ◽  
Vol 1 (2) ◽  
pp. 32-44
Author(s):  
Jiao Li ◽  
Si Zheng ◽  
Hongyu Kang ◽  
Zhen Hou ◽  
Qing Qian

AbstractPurposeIn the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas (TCGA), via a full-text literature analysis.Design/methodology/approachWe focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from PubMed Central (PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.FindingsThe amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing (RNA-seq) platform is the most preferable for use.Research limitationsThe current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implicationsThis study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/valueFew studies have been conducted to investigate data usage by government-funded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.


2020 ◽  
Author(s):  
Alexandra Kokkinaki ◽  
Justin Buck ◽  
Emma Slater ◽  
Julie Collins ◽  
Raymond Cramer ◽  
...  

<p>Ocean data are expensive to collect. Data reuse saves time and accelerates the pace of scientific discovery. For data to be re-usable the FAIR principles reassert the need for rich metadata and documentation that meet relevant community standards and provide information about provenance.</p><p>Approaches on sensor observations, are often inadequate at meeting FAIR; prescriptive with a limited set of attributes, while providing little or no provision for really important metadata about sensor observations later in the data lifecycle.</p><p>As part of the EU ENVRIplus project, our work aimed at capturing the delayed mode, data curation process taking place at the National Oceanography Centre’s British Oceanography Data Centre (BODC). Our solution uses Unique URIs, OGC SWE standards and controlled vocabularies, commencing from the submitted originators input and ending by the archived and published dataset. </p><p>The BODC delayed mode process is an example of a physical system that is composed of several components like sensors and other computations processes such as an algorithm to compute salinity or absolute winds. All components are described in sensorML identified by unique URIs and associated with the relevant datastreams, which in turn are exposed on the web via ERDDAP using unique URIs.</p><p>In this paper we intend to share our experience in using OGC standards and ERDDAP to model the above mentioned process and publish the associated datasets in a unified way. The benefits attained, allow greater automation of data transferring, easy access to large volumes of data from a chosen sensor, more precise capturing of data provenance, standardization, and pave the way towards greater FAIRness of the sensor data and metadata, focusing on the delayed mode processing.</p>


2004 ◽  
Vol 02 (02) ◽  
pp. 375-411 ◽  
Author(s):  
ZOÉ LACROIX ◽  
LOUIQA RASCHID ◽  
BARBARA A. ECKMAN

Today, scientific data are inevitably digitized, stored in a wide variety of formats, and are accessible over the Internet. Scientific discovery increasingly involves accessing multiple heterogeneous data sources, integrating the results of complex queries, and applying further analysis and visualization applications in order to collect datasets of interest. Building a scientific integration platform to support these critical tasks requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web, as well as data that are locally materialized in warehouses or generated by software. The lack of efficiency of existing approaches can significantly affect the process with lengthy delays while accessing critical resources or with the failure of the system to report any results. Some queries take so much time to be answered that their results are returned via email, making their integration with other results a tedious task. This paper presents several issues that need to be addressed to provide seamless and efficient integration of biomolecular data. Identified challenges include: capturing and representing various domain specific computational capabilities supported by a source including sequence or text search engines and traditional query processing; developing a methodology to acquire and represent semantic knowledge and metadata about source contents, overlap in source contents, and access costs; developing cost and semantics based decision support tools to select sources and capabilities, and to generate efficient query evaluation plans.


2017 ◽  
Vol 11 (2) ◽  
pp. 87-103 ◽  
Author(s):  
Chung-Yi Hou ◽  
Matthew Mayernik

As scientific research and development become more collaborative, the diversity of skills and expertise involved in producing scientific data are expanding as well. Since recognition of contribution has significant academic and professional impact for participants in scientific projects, it is important to integrate attribution and acknowledgement of scientific contributions into the research and data lifecycle. However, defining and clarifying contributions and the relationship of specific individuals and organizations can be challenging, especially when balancing the needs and interests of diverse partners. Designing an implementation method for attributing scientific contributions within complex projects that can allow ease of use and integration with existing documentation formats is another crucial consideration. To provide a versatile mechanism for organizing, documenting, and storing contributions to different types of scientific projects and their related products, an attribution and acknowledgement matrix and XML schema have been created as part of the Attribution and Acknowledgement Content Framework (AACF). Leveraging the taxonomies of contribution roles and types that have been developed and published previously, the authors consolidated 16 contribution types that could be considered and used when accrediting team member’s contributions. Using these contribution types, specific information regarding the contributing organizations and individuals can be documented using the AACF. This paper provides the background and motivations for creating the current version of the AACF Matrix and Schema, followed by demonstrations of the process and the results of using the Matrix and the Schema to record the contribution information of different sample datasets. The paper concludes by highlighting the key feedback and features to be examined in order to improve the next revisions of the Matrix and the Schema. 


2000 ◽  
Vol 5 (6) ◽  
pp. 1-7
Author(s):  
Christopher R. Brigham ◽  
James B. Talmage ◽  
Leon H. Ensalada

Abstract The AMA Guides to the Evaluation of Permanent Impairment (AMA Guides), Fifth Edition, is available and includes numerous changes that will affect both evaluators who and systems that use the AMA Guides. The Fifth Edition is nearly twice the size of its predecessor (613 pages vs 339 pages) and contains three additional chapters (the musculoskeletal system now is split into three chapters and the cardiovascular system into two). Table 1 shows how chapters in the Fifth Edition were reorganized from the Fourth Edition. In addition, each of the chapters is presented in a consistent format, as shown in Table 2. This article and subsequent issues of The Guides Newsletter will examine these changes, and the present discussion focuses on major revisions, particularly those in the first two chapters. (See Table 3 for a summary of the revisions to the musculoskeletal and pain chapters.) Chapter 1, Philosophy, Purpose, and Appropriate Use of the AMA Guides, emphasizes objective assessment necessitating a medical evaluation. Most impairment percentages in the Fifth Edition are unchanged from the Fourth because the majority of ratings currently are accepted, there is limited scientific data to support changes, and ratings should not be changed arbitrarily. Chapter 2, Practical Application of the AMA Guides, describes how to use the AMA Guides for consistent and reliable acquisition, analysis, communication, and utilization of medical information through a single set of standards.


Sign in / Sign up

Export Citation Format

Share Document