Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data

Author(s):  
Björn Schembera
Keyword(s):  
2021 ◽  
Author(s):  
Klaas Jan van Wijk ◽  
Eric W Deutsch ◽  
Qi Sun ◽  
Zhi Sun ◽  
Tami Leppert ◽  
...  

We developed a new resource, the Arabidopsis PeptideAtlas (www.peptideatlas.org/builds/arabidopsis/), to solve central questions about the Arabidopsis proteome, such as the significance of protein splice forms, post-translational modifications (PTMs), or simply obtain reliable information about specific proteins. PeptideAtlas is based on published mass spectrometry (MS) analyses collected through ProteomeXchange and reanalyzed through a uniform processing and metadata annotation pipeline. All matched MS-derived peptide data are linked to spectral, technical and biological metadata. Nearly 40 million out of ~143 million MSMS spectra were matched to the reference genome Araport11, identifying ~0.5 million unique peptides and 17858 uniquely identified proteins (only isoform per gene) at the highest confidence level (FDR 0.0004; 2 non-nested peptides ≥ 9 aa each), assigned canonical proteins, and 3543 lower confidence proteins. Physicochemical protein properties were evaluated for targeted identification of unobserved proteins. Additional proteins and isoforms currently not in Araport11 were identified, generated from pseudogenes, alternative start, stops and/or splice variants and sORFs; these features should be considered for updates to the Arabidopsis genome. Phosphorylation can be inspected through a sophisticated PTM viewer. This new PeptideAtlas is integrated with community resources including TAIR, tracks in JBrowse, PPDB and UniProtKB. Subsequent PeptideAtlas builds will incorporate millions more MS data.


Author(s):  
April Ng ◽  
Marek Hatala

Competency-based learning has been used in training employees to acquire the necessary skills for an organization to be successful in dynamic and ever-changing environment. One of the core activities in competency-based training is learning material acquisition. Standardization efforts have made the retrieval of educational materials, also called learning objects, easier by describing them in pre-defined metadata schema. However, the existing standardized metadata schema and practices of learning object metadata annotation do not support automatic selection of resources by specific competency requirements in the competency-based learning. We propose an ontology-based competency formalization approach as a way of representing competency-related information together with other metadata in ontology in order to enhance machine automation in resources retrieval. The approach represents competency with properties of definition, knowledge reference, evidence of proficiency, and level of proficiency. The effectiveness of resource selection from each of the property is evaluated.


2016 ◽  
Vol 3 (2) ◽  
pp. 136-151 ◽  
Author(s):  
Boyi Xu ◽  
Ke Xu ◽  
LiuLiu Fu ◽  
Ling Li ◽  
Weiwei Xin ◽  
...  

2020 ◽  
Author(s):  
Hannes Wartmann ◽  
Sven Heins ◽  
Karin Kloiber ◽  
Stefan Bonn

AbstractRecent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Here we investigate RNA-seq metadata prediction based on gene expression values. We present a deep-learning based domain adaptation algorithm for the automatic annotation of RNA-seq metadata. We show how our algorithm outperforms existing approaches as well as traditional deep learning methods for the prediction of tissue, sample source, and patient sex information across several large data repositories. By using a model architecture similar to siamese networks the algorithm is able to learn biases from datasets with few samples. Our domain adaptation approach achieves metadata annotation accuracies up to 12.3% better than a previously published method. Lastly, we provide a list of more than 10,000 novel tissue and sex label annotations for 8,495 unique SRA samples.


2021 ◽  
Author(s):  
Mathias Walzer ◽  
David Garcia-Seisdedos ◽  
Ananth Prakash ◽  
Paul Brack ◽  
Peter Crowther ◽  
...  

Rising numbers of mass spectrometry proteomics datasets available in the public domain, increasingly include volumes generated from Data Independent Acquisition approaches, SWATH-MS in particular. Unlike Data Dependent Acquisition datasets, their re-use is limited, partially due to challenges in combination and use of free software for analysis in the non-specialist laboratory. We introduce a (re-)analysis pipeline for SWATH-MS data available in the PRIDE database, which includes a harmonised combination of metadata annotation protocols, automated workflows for MS data, statistical analysis and results integration into the resource Expression Atlas. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available, reproducible and easy to update. To demonstrate its utility, we reanalysed 10 public DIA datasets, 1,278 individual SWATH-MS runs, stored in PRIDE. The robustness of the analysis was evaluated and compared to the results obtained in the original publications. The final results were exported into Expression Atlas, making quantitative results from SWATH-MS experiments more widely available and integrated with results from other reanalysed proteomics and transcriptomics datasets.


Author(s):  
Ichiro Kobayashi ◽  

At the annual conference of the Japan Society for Artificial Intelligence (JSAI), a special survival session called "Challenge for Realizing Early Profits (CREP)" is organized to support and promote excellent ideas in new AI technologies expected to be realized and contributed to society within five years. Every year at the session, researchers propose their ideas and compete in being evaluated by conference participants. The Everyday Language Computing (ELC) project, started in 2000 at the Brain Science Institute, RIKEN, and ended in 2005, participated in the CREP program in 2001 to have their project evaluated by third parties and held an organized session every year in which those interested in language-based intelligence and personalization participate. They competed with other candidates, survived the session, and achieved the session's final goal to survive for five years. Papers in this special issue selected for presentation at the session include the following: The first article, "Everyday-Language Computing Project Overview," by Ichiro Kobayashi et al., gives an overview and the basic technologies of the ELC Project. The second to sixth papers are related to the ELC Project. The second article, "Computational Models of Language Within Context and Context-Sensitive Language Understanding," by Noriko Ito et al., proposes a new database, called the "semiotic base," that compiles linguistic resources with contextual information and an algorithm for achieving natural language understanding with the semiotic base. The third article, "Systemic-Functional Context-Sensitive Text Generation in the Framework of Everyday Language Computing," by Yusuke Takahashi et al., proposes an algorithm to generate texts with the semiotic base. The fourth article, "Natural Language-Mediated Software Agentification," by Michiaki Iwazume et al., proposes a method for agentifying and verbalizing existing software applications, together with a scheme for operating/running them. The fifth article, "Smart Help for Novice Users Based on Application Software Manuals," by Shino Iwashita et al., proposes a new framework for reusing electronic software manuals equipped with application software to provide tailor-made operation instructions to users. The sixth article, "Programming in Everyday Language: A Case for Email Management," by Toru Sugimoto et al., making a computer program written in natural language. Rhetorical structure analysis is used to translate the natural language command structure into the program structure. The seventh article, "Application of Paraphrasing to Programming with Linguistic Expressions," by Nozomu Kaneko et al., proposes a method for translating natural language commands into a computer program through a natural language paraphrasing mechanism. The eighth article, "A Human Interface Based on Linguistic Metaphor and Intention Reasoning," by Koichi Yamada et al., proposes a new human interface paradigm called Push Like Talking (PLT), which enables people to operate machines as they talk. The ninth article, "Automatic Metadata Annotation Based on User Preference Evaluation Patterns," by Mari Saito proposes effective automatic metadata annotation for content recommendations matched to user preference. The tenth article, "Dynamic Sense Representation Using Conceptual Fuzzy Sets," by Hiroshi Sekiya et al., proposes a method to represent word senses, which vary dynamically depending on context, using conceptual fuzzy sets. The eleventh article, "Common Sense from the Web? Naturalness of Everyday Knowledge Retrieved from WWW," by Rafal Rzepka et al., is a challenging work to acquire common-sense knowledge from information on the Web. The twelfth article, "Semantic Representation for Understanding Meaning Based on Correspondence Between Meanings," by Akira Takagi et al., proposes a new semantic representation to deal with Japanese language in natural language processing. I thank the reviewers and contributors for their time and effort in making this special issue possible, and I wish to thank the JACIII editorial board, especially Professors Kaoru Hirota and Toshio Fukuda, the Editors-in-Chief, for inviting me to serve as Guest Editor of this Journal. Thanks also go to Kazuki Ohmori and Kenta Uchino of Fuji Technology Press for their sincere support.


Sign in / Sign up

Export Citation Format

Share Document