Data archiving and meta-data – saving the data for future use

Abstract This chapter covers the main aspects of data archiving, as the last phase of data handling in the process of urban drainage and stormwater management metrology. Data archiving is the process of preparing and storing the data for future use, usually not executed by the personnel who acquired the data. A data archive (also known as a data repository) can be defined as storage of a selected subset of raw, processed, validated and resampled data, with descriptions and other meta-data, linked to simulation results, if there are any. A data archive should be equipped with tools for search and data extraction along with procedures for data management, in order to maintain the database quality for an extended period of time. It is recommended, mostly for security reasons, to separate (both in a physical and in a digital sense) the archive database from the working database. This chapter provides the reader with relevant information about the most important issues related to data archive design, the archiving process and data characteristics regarding archiving. Also, the importance of good and comprehensive meta-data is underlined throughout the chapter. The management of a data archive is evaluated with a special focus on predicting future resources needed to keep the archive updated, secure, available, and in compliance with legal demands and limitations. At the end, a set of recommendations for creating and maintaining a data archive in the scope of urban drainage is given.

Download Full-text

GMrepo: a database of curated and consistently annotated human gut metagenomes

Nucleic Acids Research ◽

10.1093/nar/gkz764 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D545-D553 ◽

Cited By ~ 12

Author(s):

Sicheng Wu ◽

Chuqing Sun ◽

Yanze Li ◽

Teng Wang ◽

Longhao Jia ◽

...

Keyword(s):

Relevant Information ◽

Metagenomic Data ◽

Data Repository ◽

Healthy Controls ◽

Healthy Individuals ◽

Human Gut ◽

Meta Data ◽

Biologically Relevant ◽

Manual Curation ◽

State Of Art

Abstract GMrepo (data repository for Gut Microbiota) is a database of curated and consistently annotated human gut metagenomes. Its main purpose is to facilitate the reusability and accessibility of the rapidly growing human metagenomic data. This is achieved by consistently annotating the microbial contents of collected samples using state-of-art toolsets and by manual curation of the meta-data of the corresponding human hosts. GMrepo organizes the collected samples according to their associated phenotypes and includes all possible related meta-data such as age, sex, country, body-mass-index (BMI) and recent antibiotics usage. To make relevant information easier to access, GMrepo is equipped with a graphical query builder, enabling users to make customized, complex and biologically relevant queries. For example, to find (1) samples from healthy individuals of 18 to 25 years old with BMIs between 18.5 and 24.9, or (2) projects that are related to colorectal neoplasms, with each containing >100 samples and both patients and healthy controls. Precomputed species/genus relative abundances, prevalence within and across phenotypes, and pairwise co-occurrence information are all available at the website and accessible through programmable interfaces. So far, GMrepo contains 58 903 human gut samples/runs (including 17 618 metagenomes and 41 285 amplicons) from 253 projects concerning 92 phenotypes. GMrepo is freely available at: https://gmrepo.humangut.info.

Download Full-text

NoRCE: non-coding RNA sets cis enrichment tool

BMC Bioinformatics ◽

10.1186/s12859-021-04112-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gulden Olgun ◽

Afshan Nabi ◽

Oznur Tastan

Keyword(s):

Expression Patterns ◽

Target Prediction ◽

Enrichment Analysis ◽

Fruit Fly ◽

Relevant Information ◽

R Package ◽

Data Repository ◽

Biologically Relevant ◽

Gene Sets ◽

Data Files

Abstract Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and https://github.com/guldenolgun/NoRCE.

Download Full-text

Pharmacogenetics of Type 2 diabetes mellitus: A Systematic Review Protocol (Preprint)

10.2196/preprints.13605 ◽

2019 ◽

Author(s):

Hamid Reza Aghaei Meybodi ◽

Negar Sarhangi ◽

Anoosh Naghavi ◽

Marzieh Rahbaran ◽

Maryam Hassani Doabsari ◽

...

Keyword(s):

Systematic Review ◽

Type 2 Diabetes ◽

Data Extraction ◽

Relevant Information ◽

Risk Of Bias ◽

Cochrane Database ◽

Registration Number ◽

Newcastle Ottawa Scale ◽

Review Protocol

UNSTRUCTURED The objective of this systematic review is to determine the effect of genetic variants that associate with antidiabetic medications and their efficacy and toxicity in T2DM patients. The understanding may allow interventions for improving management of T2DM and later systematically evaluated in more in-depth studies. We will have performed a comprehensive search using PubMed, Scopus, EMBASE, Web of Sciences and Cochrane database from 1990 to 2018. Relevant journals and references of all included studies will be hand searched to find the additional studied. Eligible studies such as pharmacogenetics studies in terms of drug response and toxicity in the type 2 diabetes patients and performed just on human will be included. Data extraction and quality assessment will be carried out by two independent reviewers and disagreements will be resolved through third expert reviewer. Risk of bias will be assessed with the Cochrane Risk of Bias tool for randomized studies and Newcastle-Ottawa Scale (NOS) for observational Studies. Narrative synthesis will be conducted by the combination of key findings. The results of this study will be submitted to a peer-reviewed journal for publication and also presented at PROSPERO. We expect this review will provide highly relevant information for clinicians, pharmaceutical industry that will benefit from the summary of the best available data regarding the efficacy of antidiabetic medication in the aspect of pharmacogenetics. PROSPERO Registration number (CRD42018104843)

Download Full-text

Extracting Data from Legacy Taxonomic Literature: Applications for planning field work

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37082 ◽

2019 ◽

Vol 3 ◽

Author(s):

Francisco Andres Rivera-Quiroz ◽

Jeremy Miller

Keyword(s):

Data Extraction ◽

Field Work ◽

Abundant Species ◽

Extraction Process ◽

Biological Data ◽

Knowledge Generation ◽

Data Repository ◽

Full Potential ◽

Biodiversity Data ◽

Open Access Journals

Traditional taxonomic publications have served as a biological data repository accumulating vast amounts of data on species diversity, geographical and temporal distributions, ecological interactions, taxonomic relations, among many other types of information. However, the fragmented nature of taxonomic literature has made this data difficult to access and use to its full potential. Current anthropogenic impact on biodiversity demands faster knowledge generation, but also making better use of what we already have. This could help us make better-informed decisions about conservation and resources management. In past years, several efforts have been made to make taxonomic literature more mobilized and accessible. These include online publications, open access journals, the digitization of old paper literature and improved availability through online specialized repositories such as the Biodiversity Heritage Library (BHL) and the World Spider Catalog (WSC), among others. Although easy to share, PDF publications still have most of their biodiversity data embedded in strings of text making them less dynamic and more difficult or impossible to read and analyze without a human interpreter. Recently developed tools as GoldenGATE-Imagine (GGI) allow transforming PDFs in XML files that extract and categorize taxonomically relevant data. These data can then be aggregated in databases such as Plazi TreatmentBank, where it can be re-explored, queried and analyzed. Here we combined several of these cybertaxonomic tools to test the data extraction process for one potential application: the design and planning of an expedition to collect fresh material in the field. We targeted the ground spider Teutamus politus and other related species from the Teutamus group (TG) (Araneae; Liocranidae). These spiders are known from South East Asia and have been cataloged in the family Liocranidae; however, their relations, biology and evolution are still poorly understood. We marked-up 56 publications that contained taxonomic treatments with specimen records for the Liocranidae. Of these publications, 20 contained information on members of the TG. Geographical distributions and occurrences of 90 TG species were analyzed based on 1,309 specimen records. These data were used to design our field collection in a way that allowed us to optimize the collection of adult specimens of our target taxa. The TG genera were most common in Indonesia, Thailand and Malaysia. From these, Thailand was the second richest but had the most records of T. politus. Seasonal distribution of TG specimens in Thailand suggested June and July as the best time for collecting adults. Based on these analyses, we decided to sample from mid-July to mid-August 2018 in the three Thai provinces that combined most records of TG species and T. politus. Relying on the results of our literature analyses and using standard collection methods for ground spiders, we captured at least one specimen of every TG genus reported for Thailand. Our one-month expedition captured 231 TG spiders; from these, T. politus was the most abundant species with 188 specimens (95 adults). By comparison, a total of 196 specimens of the TG and 66 of T. politus had been reported for the same provinces in the last 40 years. Our sampling greatly increased the number of available specimens, especially for the genera Teutamus and Oedignatha. Also, we extended the known distribution of Oedignatha and Sesieutes within Thailand. These results illustrate the relevance of making biodiversity data contained within taxonomic treatments accessible and reusable. It also exemplifies one potential use of taxonomic legacy data: to more efficiently use existing biodiversity data to fill knowledge gaps. A similar approach can be used to study neglected or interesting taxa and geographic areas, generating a better biodiversity documentation that could aid in decision making, management and conservation.

Download Full-text

Current Applications of Flow Cytometry in the Diagnosis of Primary Immunodeficiency Diseases

Archives of Pathology & Laboratory Medicine ◽

10.5858/2004-128-23-caofci ◽

2004 ◽

Vol 128 (1) ◽

pp. 23-31

Author(s):

Orieji C. Illoh

Keyword(s):

Flow Cytometry ◽

Primary Immunodeficiency ◽

Laboratory Tests ◽

Data Extraction ◽

Relevant Information ◽

Primary Immunodeficiency Disease ◽

Data Sources ◽

Primary Immunodeficiency Diseases ◽

Immunodeficiency Diseases ◽

Immunodeficiency Disease

Abstract Context.—To review the applications of flow cytometry in the diagnosis and management of primary immunodeficiency disease. Data Sources.—Articles describing the use of flow cytometry in the diagnosis of several primary immunodeficiency diseases were obtained through the National Library of Medicine database. Study Selection.—Publications that described novel and known applications of flow cytometry in primary immunodeficiency disease were selected. Review articles were included. Articles describing the different immunodeficiency diseases and methods of diagnosis were also selected. Data Extraction.—Approximately 100 data sources were analyzed, and those with the most relevant information were selected. Data Synthesis.—The diagnosis of many primary immunodeficiency diseases requires the use of several laboratory tests. Flow cytometry has become an important part of the workup of individuals suspected to have such a disorder. Knowledge of the pathogenesis of many of these diseases continues to increase, hence we acquire a better understanding of the laboratory tests that may be helpful in diagnosis. Conclusions.—Flow cytometry is applicable in the initial workup and subsequent management of several primary immunodeficiency diseases. As our understanding of the pathogenesis and management of these diseases increases, the use of many of these assays may become routine in hospitals.

Download Full-text

Analyze Physical Design Process Using Big Data Tool

International Journal of Software Science and Computational Intelligence ◽

10.4018/ijssci.2015040102 ◽

2015 ◽

Vol 7 (2) ◽

pp. 31-49 ◽

Cited By ~ 3

Author(s):

Waseem Ahmed ◽

Lisa Fan

Keyword(s):

Design Process ◽

Data Extraction ◽

Daily Basis ◽

Physical Design ◽

Data Repository ◽

Chip Design ◽

Asic Design ◽

Design Engineers ◽

Statistical Representation ◽

And Performance

Physical Design (PD) Data tool is designed mainly to help ASIC design engineers in achieving chip design process quality, optimization and performance measures. The tool uses data mining techniques to handle the existing unstructured data repository. It extracts the relevant data and loads it into a well-structured database. Data archive mechanism is enabled that initially creates and then keeps updating an archive repository on a daily basis. The logs information provide to PD tool is a completely unstructured format which parse by regular expression (regex) based data extraction methodology. It converts the input data into the structured tables. This undergoes the data cleansing process before being fed into the operational DB. PD tool also ensures data integrity and data validity. It helps the design engineers to compare, correlate and inter-relate the results of their existing work with the ones done in the past which gives them a clear picture of the progress made and deviations that occurred. Data analysis can be done using various features offered by the tool such as graphical and statistical representation.

Download Full-text

Meta Data Repository

10.1007/springerreference_63929 ◽

2011 ◽

Keyword(s):

Data Repository ◽

Meta Data

Download Full-text

Data2paper: Giving Researchers Credit for Their Data

Publications ◽

10.3390/publications7020036 ◽

2019 ◽

Vol 7 (2) ◽

pp. 36 ◽

Cited By ~ 1

Author(s):

Neil Jefferies ◽

Fiona Murphy ◽

Anusha Ranganathan ◽

Hollydawn Murray

Keyword(s):

Data Management ◽

Management Practice ◽

Data Repository ◽

Data Archiving ◽

Good Data ◽

Submission Process ◽

The Individual ◽

Paper Submission ◽

Do So

Initially funded as part of the Jisc Data Spring Initiative, a team of stakeholders (publishers, data repository managers, coders) has developed a simple workflow to streamline data paper submission. Metadata about a dataset in a data repository is combined with ORCID metadata about the author to automate and thus greatly reduce the friction of the submission process. Funders are becoming more interested in good data management practice, and institutions are developing repositories to hold the data outputs of their researchers, reducing the individual burden of data archiving. However, to date only a subset of the data produced is associated with publications and thus reliably archived, shared and re-used. This represents a loss of knowledge, leading to the repetition of research (especially in the case of negative observations) and wastes resources. It is laborious for time-poor researchers to fully describe their data via an associated article to maximise its utility to others, and there is little incentive for them to do so. Filling out diverse submission forms, for the repository and journal(s), makes things even lengthier. The app makes the process of associating and publishing data with a detailed description easier, with corresponding citation potential and credit benefits.

Download Full-text

Development of an Automated, Real Time Surveillance Tool for Predicting Readmissions at a Community Hospital

Applied Clinical Informatics ◽

10.4338/aci-2012-12-ra-0058 ◽

2013 ◽

Vol 04 (02) ◽

pp. 153-169 ◽

Cited By ~ 6

Author(s):

R. Gildersleeve ◽

P. Cooper

Keyword(s):

Real Time ◽

Community Hospital ◽

Validation Cohort ◽

Data Extraction ◽

Characteristic Curve ◽

Hospital Setting ◽

Data Repository ◽

Derivation Cohort ◽

Effective Interventions ◽

Patients At Risk

SummaryBackground: The Centers for Medicare and Medicaid Services’ Readmissions Reduction Program adjusts payments to hospitals based on 30-day readmission rates for patients with acute myocardial infarction, heart failure, and pneumonia. This holds hospitals accountable for a complex phenomenon about which there is little evidence regarding effective interventions. Further study may benefit from a method for efficiently and inexpensively identifying patients at risk of readmission. Several models have been developed to assess this risk, many of which may not translate to a U.S. community hospital setting.Objective: To develop a real-time, automated tool to stratify risk of 30-day readmission at a semi-rural community hospital.Methods: A derivation cohort was created by extracting demographic and clinical variables from the data repository for adult discharges from calendar year 2010. Multivariate logistic regression identified variables that were significantly associated with 30-day hospital readmission. Those variables were incorporated into a formula to produce a Risk of Readmission Score (RRS). A validation cohort from 2011 assessed the predictive value of the RRS. A SQL stored procedure was created to calculate the RRS for any patient and publish its value, along with an estimate of readmission risk and other factors, to a secure intranet site.Results: Eleven variables were significantly associated with readmission in the multivariate analysis of each cohort. The RRS had an area under the receiver operating characteristic curve (c-statistic) of 0.74 (95% CI 0.73-0.75) in the derivation cohort and 0.70 (95% CI 0.69-0.71) in the validation cohort.Conclusion: Clinical and administrative data available in a typical community hospital database can be used to create a validated, predictive scoring system that automatically assigns a probability of 30-day readmission to hospitalized patients. This does not require manual data extraction or manipulation and uses commonly available systems. Additional study is needed to refine and confirm the findings.Citation: Gildersleeve R, Cooper P. Development of an automated, real time surveillance tool for predicting readmissions at a community hospital. Appl Clin Inf 2013; 4: 153–169http://dx.doi.org/10.4338/ACI-2012-12-RA-0058

Download Full-text

Exposing scholarly information as Linked Open Data: RDFizing DSpace contents

The Electronic Library ◽

10.1108/el-12-2012-0156 ◽

2014 ◽

Vol 32 (6) ◽

pp. 834-851 ◽

Cited By ~ 13

Author(s):

Nikolaos Konstantinou ◽

Dimitrios-Emmanuel Spanos ◽

Nikos Houssos ◽

Nikolaos Mitrou

Keyword(s):

Semantic Web ◽

Design Methodology ◽

State Of The Art ◽

Open Data ◽

Linked Open Data ◽

Data Repository ◽

Institutional Repository ◽

Meta Data ◽

Content Type ◽

Compliant Approach

Purpose – This paper aims to introduce a transformation engine which can be used to convert an existing institutional repository installation into a Linked Open Data repository. Design/methodology/approach – The authors describe how the data that exist in a DSpace repository can be semantically annotated to serve as a Semantic Web (meta)data repository. Findings – The authors present a non-intrusive, standards-compliant approach that can run alongside with current practices, while incorporating state-of-the art methodologies. Originality/value – Also, they propose a set of mappings between domain vocabularies that can be (re)used towards this goal, thus offering an approach that covers both the technical and semantic aspects of the procedure.

Download Full-text