distant supervision
Recently Published Documents


TOTAL DOCUMENTS

260
(FIVE YEARS 149)

H-INDEX

14
(FIVE YEARS 5)

2022 ◽  
Vol 40 (4) ◽  
pp. 1-32
Author(s):  
Rui Li ◽  
Cheng Yang ◽  
Tingwei Li ◽  
Sen Su

Relation extraction (RE), an important information extraction task, faced the great challenge brought by limited annotation data. To this end, distant supervision was proposed to automatically label RE data, and thus largely increased the number of annotated instances. Unfortunately, lots of noise relation annotations brought by automatic labeling become a new obstacle. Some recent studies have shown that the teacher-student framework of knowledge distillation can alleviate the interference of noise relation annotations via label softening. Nevertheless, we find that they still suffer from two problems: propagation of inaccurate dark knowledge and constraint of a unified distillation temperature . In this article, we propose a simple and effective Multi-instance Dynamic Temperature Distillation (MiDTD) framework, which is model-agnostic and mainly involves two modules: multi-instance target fusion (MiTF) and dynamic temperature regulation (DTR). MiTF combines the teacher’s predictions for multiple sentences with the same entity pair to amend the inaccurate dark knowledge in each student’s target. DTR allocates alterable distillation temperatures to different training instances to enable the softness of most student’s targets to be regulated to a moderate range. In experiments, we construct three concrete MiDTD instantiations with BERT, PCNN, and BiLSTM-based RE models, and the distilled students significantly outperform their teachers and the state-of-the-art (SOTA) methods.


Author(s):  
Li-Ming Chen ◽  
Bao-Xin Xiu ◽  
Zhao-Yun Ding

AbstractFor short text classification, insufficient labeled data, data sparsity, and imbalanced classification have become three major challenges. For this, we proposed multiple weak supervision, which can label unlabeled data automatically. Different from prior work, the proposed method can generate probabilistic labels through conditional independent model. What’s more, experiments were conducted to verify the effectiveness of multiple weak supervision. According to experimental results on public dadasets, real datasets and synthetic datasets, unlabeled imbalanced short text classification problem can be solved effectively by multiple weak supervision. Notably, without reducing precision, recall, and F1-score can be improved by adding distant supervision clustering, which can be used to meet different application needs.


Author(s):  
Juan-Luis García-Mendoza ◽  
Luis Villaseñor-Pineda ◽  
Felipe Orihuela-Espina ◽  
Lázaro Bustio-Martínez

Distant Supervision is an approach that allows automatic labeling of instances. This approach has been used in Relation Extraction. Still, the main challenge of this task is handling instances with noisy labels (e.g., when two entities in a sentence are automatically labeled with an invalid relation). The approaches reported in the literature addressed this problem by employing noise-tolerant classifiers. However, if a noise reduction stage is introduced before the classification step, this increases the macro precision values. This paper proposes an Adversarial Autoencoders-based approach for obtaining a new representation that allows noise reduction in Distant Supervision. The representation obtained using Adversarial Autoencoders minimize the intra-cluster distance concerning pre-trained embeddings and classic Autoencoders. Experiments demonstrated that in the noise-reduced datasets, the macro precision values obtained over the original dataset are similar using fewer instances considering the same classifier. For example, in one of the noise-reduced datasets, the macro precision was improved approximately 2.32% using 77% of the original instances. This suggests the validity of using Adversarial Autoencoders to obtain well-suited representations for noise reduction. Also, the proposed approach maintains the macro precision values concerning the original dataset and reduces the total instances needed for classification.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Xindong You ◽  
Meijing Yang ◽  
Junmei Han ◽  
Jiangwei Ma ◽  
Gang Xiao ◽  
...  

The effective organization and utilization of military equipment data is an important cornerstone for constructing knowledge system. Building a knowledge graph in the field of military equipment can effectively describe the relationship between entity and entity attribute information. Therefore, relevant personnel can obtain information quickly and accurately. Attribute extraction is an important part of building the knowledge graph. Given the lack of annotated data in the field of military equipment, we propose a new data annotation method, which adopts the idea of distant supervision to automatically build the attribute extraction dataset. We convert the attribute extraction task into a sequence annotation task. At the same time, we propose a RoBERTa-BiLSTM-CRF-SEL-based attribute extraction method. Firstly, a list of attribute name synonyms is constructed, then a corpus of military equipment attributes is obtained through automatic annotation of semistructured data in Baidu Encyclopedia. RoBERTa is used to obtain the vector encoding of the text. Then, input it into the entity boundary prediction layer to label the entity head and tail, and input the BiLSTM-CRF layer to predict the attribute label. The experimental results show that the proposed method can effectively perform attribute extraction in the military equipment domain. The F 1 value of the model reaches 77% on the constructed attribute extraction dataset, which outperforms the current state-of-art model.


2021 ◽  
Vol 21 (S7) ◽  
Author(s):  
Bhavani Singh Agnikula Kshatriya ◽  
Elham Sagheb ◽  
Chung-Il Wi ◽  
Jungwon Yoon ◽  
Hee Yun Seol ◽  
...  

Abstract Background There are significant variabilities in guideline-concordant documentation in asthma care. However, assessing clinician’s documentation is not feasible using only structured data but requires labor-intensive chart review of electronic health records (EHRs). A certain guideline element in asthma control factors, such as review inhaler techniques, requires context understanding to correctly capture from EHR free text. Methods The study data consist of two sets: (1) manual chart reviewed data—1039 clinical notes of 300 patients with asthma diagnosis, and (2) weakly labeled data (distant supervision)—27,363 clinical notes from 800 patients with asthma diagnosis. A context-aware language model, Bidirectional Encoder Representations from Transformers (BERT) was developed to identify inhaler techniques in EHR free text. Both original BERT and clinical BioBERT (cBERT) were applied with a cost-sensitivity to deal with imbalanced data. The distant supervision using weak labels by rules was also incorporated to augment the training set and alleviate a costly manual labeling process in the development of a deep learning algorithm. A hybrid approach using post-hoc rules was also explored to fix BERT model errors. The performance of BERT with/without distant supervision, hybrid, and rule-based models were compared in precision, recall, F-score, and accuracy. Results The BERT models on the original data performed similar to a rule-based model in F1-score (0.837, 0.845, and 0.838 for rules, BERT, and cBERT, respectively). The BERT models with distant supervision produced higher performance (0.853 and 0.880 for BERT and cBERT, respectively) than without distant supervision and a rule-based model. The hybrid models performed best in F1-score of 0.877 and 0.904 over the distant supervision on BERT and cBERT. Conclusions The proposed BERT models with distant supervision demonstrated its capability to identify inhaler techniques in EHR free text, and outperformed both the rule-based model and BERT models trained on the original data. With a distant supervision approach, we may alleviate costly manual chart review to generate the large training data required in most deep learning-based models. A hybrid model was able to fix BERT model errors and further improve the performance.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Qian Yi ◽  
Guixuan Zhang ◽  
Shuwu Zhang

Distant supervision is an effective method to automatically collect large-scale datasets for relation extraction (RE). Automatically constructed datasets usually comprise two types of noise: the intrasentence noise and the wrongly labeled noisy sentence. To address issues caused by the above two types of noise and improve distantly supervised relation extraction, this paper proposes a novel distantly supervised relation extraction model, which consists of an entity-based gated convolution sentence encoder and a multilevel sentence selective attention (Matt) module. Specifically, we first apply an entity-based gated convolution operation to force the sentence encoder to extract entity-pair-related features and filter out useless intrasentence noise information. Furthermore, the multilevel attention schema fuses the bag information to obtain a fine-grained bag-specific query vector, which can better identify valid sentences and reduce the influence of wrongly labeled sentences. Experimental results on a large-scale benchmark dataset show that our model can effectively reduce the influence of the above two types of noise and achieves state-of-the-art performance in relation extraction.


2021 ◽  
Author(s):  
Qingwen Tian ◽  
Shixing Zhou ◽  
Yu Cheng ◽  
Jianxia Chen ◽  
Yi Gao ◽  
...  

Knowledge Graph is a semantic network that reveals the relationship between entities, which construction is to describe various entities, concepts and their relationships in the real world. Since knowledge graph can effectively reveal the relationship between the different knowledge items, it has been widely utilized in the intelligent education. In particular, relation extraction is the critical part of knowledge graph and plays a very important role in the construction of knowledge graph. According to the different magnitude of data labeling, entity relationship extraction tasks of deep learning can be divided into two categories: supervised and distant supervised. Supervised learning approaches can extract effective entity relationships. However, these approaches rely on labeled data heavily resulting in the time-consuming and laborconsuming. The distant supervision approach is widely concerned by researchers because it can generate the entity relation extraction automatically. However, the development and application of the distant supervised approach has been seriously hindered due to the noises, lack of information and disequilibrium in the relation extraction tasks. Inspired by the above analysis, the paper proposes a novel curriculum points relationship extraction model based on the distant supervision. In particular, firstly the research of the distant supervised relationship extraction model based on the sentence bag attention mechanism to extract the relationship of curriculum points. Secondly, the research of knowledge graph construction based on the knowledge ontology. Thirdly, the development of curriculum semantic retrieval platform based on Web. Compared with the existing advanced models, the AUC of this system is increased by 14.2%; At the same time, taking "big data processing" course in computer field as an example, the relationship extraction result with F1 value of 88.1% is realized. The experimental results show that the proposed model provides an effective solution for the development and application of knowledge graph in the field of intelligent education.


Sign in / Sign up

Export Citation Format

Share Document