annotation data
Recently Published Documents


TOTAL DOCUMENTS

111
(FIVE YEARS 46)

H-INDEX

13
(FIVE YEARS 4)

2022 ◽  
Vol 40 (4) ◽  
pp. 1-32
Author(s):  
Rui Li ◽  
Cheng Yang ◽  
Tingwei Li ◽  
Sen Su

Relation extraction (RE), an important information extraction task, faced the great challenge brought by limited annotation data. To this end, distant supervision was proposed to automatically label RE data, and thus largely increased the number of annotated instances. Unfortunately, lots of noise relation annotations brought by automatic labeling become a new obstacle. Some recent studies have shown that the teacher-student framework of knowledge distillation can alleviate the interference of noise relation annotations via label softening. Nevertheless, we find that they still suffer from two problems: propagation of inaccurate dark knowledge and constraint of a unified distillation temperature . In this article, we propose a simple and effective Multi-instance Dynamic Temperature Distillation (MiDTD) framework, which is model-agnostic and mainly involves two modules: multi-instance target fusion (MiTF) and dynamic temperature regulation (DTR). MiTF combines the teacher’s predictions for multiple sentences with the same entity pair to amend the inaccurate dark knowledge in each student’s target. DTR allocates alterable distillation temperatures to different training instances to enable the softness of most student’s targets to be regulated to a moderate range. In experiments, we construct three concrete MiDTD instantiations with BERT, PCNN, and BiLSTM-based RE models, and the distilled students significantly outperform their teachers and the state-of-the-art (SOTA) methods.


2022 ◽  
Vol 4 (1) ◽  
Author(s):  
Pavel P Kuksa ◽  
Yuk Yee Leung ◽  
Prabhakaran Gangadharan ◽  
Zivadin Katanic ◽  
Lauren Kleidermacher ◽  
...  

ABSTRACT Querying massive functional genomic and annotation data collections, linking and summarizing the query results across data sources/data types are important steps in high-throughput genomic and genetic analytical workflows. However, these steps are made difficult by the heterogeneity and breadth of data sources, experimental assays, biological conditions/tissues/cell types and file formats. FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface. FILER uniquely provides: (i) streamlined access to >50 000 harmonized, annotated genomic datasets across >20 integrated data sources, >1100 tissues/cell types and >20 experimental assays; (ii) a scalable genomic querying interface; and (iii) ability to analyze and annotate user’s experimental data. This rich resource spans >17 billion GRCh37/hg19 and GRCh38/hg38 genomic records. Our benchmark querying 7 × 109 hg19 FILER records shows FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals. Together, these features facilitate reproducible research and streamline integrating/querying large-scale genomic data within analyses/workflows. FILER can be deployed on cloud or local servers (https://bitbucket.org/wanglab-upenn/FILER) for integration with custom pipelines and is freely available (https://lisanwanglab.org/FILER).


2022 ◽  
Vol 16 ◽  
pp. 117793222110627
Author(s):  
Angelica Lindlöf

The hippocampus has been shown to have a major role in learning and memory, but also to participate in the regulation of emotions. However, its specific role(s) in memory is still unclear. Hippocampal damage or dysfunction mainly results in memory issues, especially in the declarative memory but, in animal studies, has also shown to lead to hyperactivity and difficulty in inhibiting responses previously taught. The brain structure is affected in neuropathological disorders, such as Alzheimer’s, epilepsy, and schizophrenia, and also by depression and stress. The hippocampus structure is far from mature at birth and undergoes substantial development throughout infant and juvenile life. The aim of this study was to survey genes highly expressed throughout the postnatal period in mouse hippocampus and which have also been linked to an abnormal phenotype through mutational studies to achieve a greater understanding about hippocampal functions during postnatal development. Publicly available gene expression data from C57BL/6 mouse hippocampus was analyzed; from a total of 5 time points (at postnatal day 1, 10, 15, 21, and 30), 547 genes highly expressed in all of these time points were selected for analysis. Highly expressed genes are considered to be of potential biological importance and appear to be multifunctional, and hence any dysfunction in such a gene will most likely have a large impact on the development of abilities during the postnatal and juvenile period. Phenotypic annotation data downloaded from Mouse Genomic Informatics database were analyzed for these genes, and the results showed that many of them are important for proper embryo development and infant survival, proper growth, and increase in body size, as well as for voluntary movement functions, motor coordination, and balance. The results also indicated an association with seizures that have primarily been characterized by uncontrolled motor activity and the development of proper grooming abilities. The complete list of genes and their phenotypic annotation data have been compiled in a file for easy access.


Metabolites ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 14
Author(s):  
Anurag Passi ◽  
Juan D. Tibocha-Bonilla ◽  
Manish Kumar ◽  
Diego Tec-Campos ◽  
Karsten Zengler ◽  
...  

Genome-scale metabolic models (GEMs) enable the mathematical simulation of the metabolism of archaea, bacteria, and eukaryotic organisms. GEMs quantitatively define a relationship between genotype and phenotype by contextualizing different types of Big Data (e.g., genomics, metabolomics, and transcriptomics). In this review, we analyze the available Big Data useful for metabolic modeling and compile the available GEM reconstruction tools that integrate Big Data. We also discuss recent applications in industry and research that include predicting phenotypes, elucidating metabolic pathways, producing industry-relevant chemicals, identifying drug targets, and generating knowledge to better understand host-associated diseases. In addition to the up-to-date review of GEMs currently available, we assessed a plethora of tools for developing new GEMs that include macromolecular expression and dynamic resolution. Finally, we provide a perspective in emerging areas, such as annotation, data managing, and machine learning, in which GEMs will play a key role in the further utilization of Big Data.


2021 ◽  
Vol 19 (3) ◽  
pp. e23
Author(s):  
Sizhuo Ouyang ◽  
Yuxing Wang ◽  
Kaiyin Zhou ◽  
Jingbo Xia

Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.


2021 ◽  
Author(s):  
Zhenshan Bao ◽  
Yuezhang Wang ◽  
Wenbo Zhang

Most existing approaches to named entity recognition (NER) rely on a large amount of highquality annotations or a more complete specific entity lists. However, in practice, it is very expensive to obtain manually annotated data, and the list of entities that can be used is often not comprehensive. Using the entity list to automatically annotate data is a common annotation method, but the automatically annotated data is usually not perfect under low-resource conditions, including incomplete annotation data or non-annotated data. In this paper, we propose a NER system for complex data processing, which could use an entity list containing only a few entities to obtain incomplete annotation data, and train the NER model without human annotation. Our system extracts semantic features from a small number of samples by introducing a pre-trained language model. Based on the incomplete annotations model, we relabel the data using a cross-iteration approach. We use the data filtering method to filter the training data used in the iteration process, and re-annotate the incomplete data through multiple iterations to obtain high-quality data. Each iteration will do corresponding grouping and processing according to different types of annotations, which can improve the model performance faster and reduce the number of iterations. The experimental results demonstrate that our proposed system can effectively perform low-resource NER tasks without human annotation.


2021 ◽  
Vol 7 (9) ◽  
pp. 699
Author(s):  
Zhigang Hao ◽  
Yuanyuan Li ◽  
Yunyun Jiang ◽  
Jiaqing Xu ◽  
Jianqiang Li ◽  
...  

Fusarium graminearum is a plant pathogen of global importance which causes not only significant yield loss but also crop spoilage due to mycotoxins that render grain unsafe for human or livestock consumption. Although the full genome of several F. graminearum isolates from different parts of the world have been sequenced, there are no similar studies of isolates originating from China. The current study sought to address this by sequencing the F. graminearum isolate FG-12, which was isolated from the roots of maize seedlings exhibiting typical symptoms of blight growing in the Gansu province, China, using Oxford Nanopore Technology (ONT). The FG-12 isolate was found to have a 35.9 Mb genome comprised of five scaffolds corresponding to the four chromosomes and mitochondrial DNA of the F. graminearum type strain, PH-1. The genome was found to contain an approximately 2.23% repetitive sequence and encode 12,470 predicted genes. Additional bioinformatic analysis identified 437 genes that were predicted to be secreted effectors, one of which was confirmed to trigger a hypersensitive responses (HR) in the leaves of Nicotiana benthamiana during transient expression experiments utilizing agro-infiltration. The F. graminearum FG-12 genome sequence and annotation data produced in the current study provide an extremely useful resource for both intra- and inter-species comparative analyses as well as for gene functional studies, and could greatly advance our understanding of this important plant pathogen.


Data in Brief ◽  
2021 ◽  
Vol 35 ◽  
pp. 106770 ◽  
Author(s):  
Melvin Chan ◽  
Emmanuel K. Tse ◽  
Seraph Bao ◽  
Mai Berger ◽  
Nadia Beyzaei ◽  
...  

Viruses ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 437
Author(s):  
David F. Nieuwenhuijse ◽  
Bas B. Oude Munnink ◽  
Marion P. G. Koopmans

Experiments in which complex virome sequencing data is generated remain difficult to explore and unpack for scientists without a background in data science. The processing of raw sequencing data by high throughput sequencing workflows usually results in contigs in FASTA format coupled to an annotation file linking the contigs to a reference sequence or taxonomic identifier. The next step is to compare the virome of different samples based on the metadata of the experimental setup and extract sequences of interest that can be used in subsequent analyses. The viromeBrowser is an application written in the opensource R shiny framework that was developed in collaboration with end-users and is focused on three common data analysis steps. First, the application allows interactive filtering of annotations by default or custom quality thresholds. Next, multiple samples can be visualized to facilitate comparison of contig annotations based on sample specific metadata values. Last, the application makes it easy for users to extract sequences of interest in FASTA format. With the interactive features in the viromeBrowser we aim to enable scientists without a data science background to compare and extract annotation data and sequences from virome sequencing analysis results.


Sign in / Sign up

Export Citation Format

Share Document