metadata annotation
Recently Published Documents


TOTAL DOCUMENTS

33
(FIVE YEARS 13)

H-INDEX

5
(FIVE YEARS 1)

2022 ◽  
Vol 12 (2) ◽  
pp. 796
Author(s):  
Julia Sasse ◽  
Johannes Darms ◽  
Juliane Fluck

For all research data collected, data descriptions and information about the corresponding variables are essential for data analysis and reuse. To enable cross-study comparisons and analyses, semantic interoperability of metadata is one of the most important requirements. In the area of clinical and epidemiological studies, data collection instruments such as case report forms (CRFs), data dictionaries and questionnaires are critical for metadata collection. Even though data collection instruments are often created in a digital form, they are mostly not machine readable; i.e., they are not semantically coded. As a result, the comparison between data collection instruments is complex. The German project NFDI4Health is dedicated to the development of national research data infrastructure for personal health data, and as such searches for ways to enhance semantic interoperability. Retrospective integration of semantic codes into study metadata is important, as ongoing or completed studies contain valuable information. However, this is labor intensive and should be eased by software. To understand the market and find out what techniques and technologies support retrospective semantic annotation/enrichment of metadata, we conducted a literature review. In NFDI4Health, we identified basic requirements for semantic metadata annotation software in the biomedical field and in the context of the FAIR principles. Ten relevant software systems were summarized and aligned with those requirements. We concluded that despite active research on semantic annotation systems, no system meets all requirements. Consequently, further research and software development in this area is needed, as interoperability of data dictionaries, questionnaires and data collection tools is key to reusing and combining results from independent research studies.


2021 ◽  
Author(s):  
Hyeoneui Kim ◽  
Jinsun Jung ◽  
Jisung Choi

BACKGROUND Dietary habits offer crucial information on one's health and form a considerable part of the Patient-Generated Health Data (PGHD). Dietary data are collected through various channels and formats; thus, interoperability is a significant challenge to reusing the data. The vast scope of dietary concepts and colloquial style of expression add difficulty to the standardization task. Common Data Elements (CDE) with metadata annotation and ontological structuring of dietary concepts address the interoperability issues of dietary data to some extent. However, challenges remaining in making culture-specific dietary habits and questionnaire-based dietary assessment data interoperable require additional efforts. OBJECTIVE The main goal of this study was to address the interoperability challenge in dietary concepts by combining ontological curation of dietary concepts and metadata annotation of questionnaire-based dietary data. Specifically, this study aimed to develop a Dietary Lifestyle Ontology (DILON) and demonstrated the improved interoperability of questionnaire-based dietary data by annotating its main semantics with DILON. METHODS By analyzing 1158 dietary assessment data elements (367 in Korean and 791 in English), 515 dietary concepts were extracted and used to construct DILON. To demonstrate the utility of DILON in improving the interoperability of multi-cultural questionnaire-based dietary data, ten Competency Questions (CQs) were developed that identified data elements that share the same dietary topics and measurement qualities. As the test cases, 68 dietary habit data elements from Korean and English questionnaires were instantiated and annotated with the dietary concepts in DILON. The competency questions were translated into Semantic Query-enhanced Web Rule Language (SQWRL), and the query results were reviewed for accuracy. RESULTS DILON was built with 260 concept classes and 486 instances and successfully validated with ontology validation tools. A small overlap (72 concepts) in the concepts extracted from the questionnaires in two languages indicates the need to pay closer attention to representing culture-specific dietary concepts. The SQWRL queries reflecting the 10 CQs yielded the correct results. CONCLUSIONS Ensuring the interoperability of dietary lifestyle data is a demanding task due to its vast scope and variations in expression. This study demonstrated that, when combined with common data elements and semantic metadata annotation, ontology can effectively mediate the interoperability of dietary data generated in different cultural contexts and expressed in various styles.


2021 ◽  
Author(s):  
Susanne Kunis ◽  
Sebastian Hänsch ◽  
Christian Schmidt ◽  
Frances Wong ◽  
Caterina Strambio-De-Castillia ◽  
...  

GigaScience ◽  
2021 ◽  
Vol 10 (9) ◽  
Author(s):  
Hannes Wartmann ◽  
Sven Heins ◽  
Karin Kloiber ◽  
Stefan Bonn

Abstract Background Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. Findings Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning–based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression–based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. Conclusion Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.


2021 ◽  
Author(s):  
Mathias Walzer ◽  
David Garcia-Seisdedos ◽  
Ananth Prakash ◽  
Paul Brack ◽  
Peter Crowther ◽  
...  

Rising numbers of mass spectrometry proteomics datasets available in the public domain, increasingly include volumes generated from Data Independent Acquisition approaches, SWATH-MS in particular. Unlike Data Dependent Acquisition datasets, their re-use is limited, partially due to challenges in combination and use of free software for analysis in the non-specialist laboratory. We introduce a (re-)analysis pipeline for SWATH-MS data available in the PRIDE database, which includes a harmonised combination of metadata annotation protocols, automated workflows for MS data, statistical analysis and results integration into the resource Expression Atlas. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available, reproducible and easy to update. To demonstrate its utility, we reanalysed 10 public DIA datasets, 1,278 individual SWATH-MS runs, stored in PRIDE. The robustness of the analysis was evaluated and compared to the results obtained in the original publications. The final results were exported into Expression Atlas, making quantitative results from SWATH-MS experiments more widely available and integrated with results from other reanalysed proteomics and transcriptomics datasets.


2021 ◽  
Author(s):  
Klaas Jan van Wijk ◽  
Eric W Deutsch ◽  
Qi Sun ◽  
Zhi Sun ◽  
Tami Leppert ◽  
...  

We developed a new resource, the Arabidopsis PeptideAtlas (www.peptideatlas.org/builds/arabidopsis/), to solve central questions about the Arabidopsis proteome, such as the significance of protein splice forms, post-translational modifications (PTMs), or simply obtain reliable information about specific proteins. PeptideAtlas is based on published mass spectrometry (MS) analyses collected through ProteomeXchange and reanalyzed through a uniform processing and metadata annotation pipeline. All matched MS-derived peptide data are linked to spectral, technical and biological metadata. Nearly 40 million out of ~143 million MSMS spectra were matched to the reference genome Araport11, identifying ~0.5 million unique peptides and 17858 uniquely identified proteins (only isoform per gene) at the highest confidence level (FDR 0.0004; 2 non-nested peptides ≥ 9 aa each), assigned canonical proteins, and 3543 lower confidence proteins. Physicochemical protein properties were evaluated for targeted identification of unobserved proteins. Additional proteins and isoforms currently not in Araport11 were identified, generated from pseudogenes, alternative start, stops and/or splice variants and sORFs; these features should be considered for updates to the Arabidopsis genome. Phosphorylation can be inspected through a sophisticated PTM viewer. This new PeptideAtlas is integrated with community resources including TAIR, tracks in JBrowse, PPDB and UniProtKB. Subsequent PeptideAtlas builds will incorporate millions more MS data.


2021 ◽  
Vol 2 (2) ◽  
pp. 113-136
Author(s):  
F. Batista ◽  
H. Moniz ◽  
I. Trancoso ◽  
N. Mamede ◽  
A. I. Mata

This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora.


2020 ◽  
Author(s):  
Hannes Wartmann ◽  
Sven Heins ◽  
Karin Kloiber ◽  
Stefan Bonn

AbstractRecent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Here we investigate RNA-seq metadata prediction based on gene expression values. We present a deep-learning based domain adaptation algorithm for the automatic annotation of RNA-seq metadata. We show how our algorithm outperforms existing approaches as well as traditional deep learning methods for the prediction of tissue, sample source, and patient sex information across several large data repositories. By using a model architecture similar to siamese networks the algorithm is able to learn biases from datasets with few samples. Our domain adaptation approach achieves metadata annotation accuracies up to 12.3% better than a previously published method. Lastly, we provide a list of more than 10,000 novel tissue and sex label annotations for 8,495 unique SRA samples.


Sign in / Sign up

Export Citation Format

Share Document