scholarly journals Modeling Missing Data in Distant Supervision for Information Extraction

Author(s):  
Alan Ritter ◽  
Luke Zettlemoyer ◽  
Mausam ◽  
Oren Etzioni

Distant supervision algorithms learn information extraction models given only large readily available databases and text collections. Most previous work has used heuristics for generating labeled data, for example assuming that facts not contained in the database are not mentioned in the text, and facts in the database must be mentioned at least once. In this paper, we propose a new latent-variable approach that models missing data. This provides a natural way to incorporate side information, for instance modeling the intuition that text will often mention rare entities which are likely to be missing in the database. Despite the added complexity introduced by reasoning about missing data, we demonstrate that a carefully designed local search approach to inference is very accurate and scales to large datasets. Experiments demonstrate improved performance for binary and unary relation extraction when compared to learning with heuristic labels, including on average a 27% increase in area under the precision recall curve in the binary case.

2020 ◽  
Vol 2020 ◽  
pp. 1-9 ◽  
Author(s):  
Nada Boudjellal ◽  
Huaping Zhang ◽  
Asif Khan ◽  
Arshad Ahmad

With the accelerating growth of big data, especially in the healthcare area, information extraction is more needed currently than ever, for it can convey unstructured information into an easily interpretable structured data. Relation extraction is the second of the two important tasks of relation extraction. This study presents an overview of relation extraction using distant supervision, providing a generalized architecture of this task based on the state-of-the-art work that proposed this method. Besides, it surveys the methods used in the literature targeting this topic with a description of different knowledge bases used in the process along with the corpora, which can be helpful for beginner practitioners seeking knowledge on this subject. Moreover, the limitations of the proposed approaches and future challenges were highlighted, and possible solutions were proposed.


2022 ◽  
Vol 40 (4) ◽  
pp. 1-32
Author(s):  
Rui Li ◽  
Cheng Yang ◽  
Tingwei Li ◽  
Sen Su

Relation extraction (RE), an important information extraction task, faced the great challenge brought by limited annotation data. To this end, distant supervision was proposed to automatically label RE data, and thus largely increased the number of annotated instances. Unfortunately, lots of noise relation annotations brought by automatic labeling become a new obstacle. Some recent studies have shown that the teacher-student framework of knowledge distillation can alleviate the interference of noise relation annotations via label softening. Nevertheless, we find that they still suffer from two problems: propagation of inaccurate dark knowledge and constraint of a unified distillation temperature . In this article, we propose a simple and effective Multi-instance Dynamic Temperature Distillation (MiDTD) framework, which is model-agnostic and mainly involves two modules: multi-instance target fusion (MiTF) and dynamic temperature regulation (DTR). MiTF combines the teacher’s predictions for multiple sentences with the same entity pair to amend the inaccurate dark knowledge in each student’s target. DTR allocates alterable distillation temperatures to different training instances to enable the softness of most student’s targets to be regulated to a moderate range. In experiments, we construct three concrete MiDTD instantiations with BERT, PCNN, and BiLSTM-based RE models, and the distilled students significantly outperform their teachers and the state-of-the-art (SOTA) methods.


2021 ◽  
Author(s):  
Roopali Singh ◽  
Feipeng Zhang ◽  
Qunhua Li

High-throughput experiments are an essential part of modern biological and biomedical research. The outcomes of high-throughput biological experiments often have a lot of missing observations due to signals below detection levels. For example, most single-cell RNA-seq (scRNA-seq) protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments. In this paper, we develop a regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (e.g., platform or sequencing depth) when a large number of measurements are missing. Using a latent variable approach, we extend correspondence curve regression (CCR), a recently proposed method for assessing the effects of operational factors to reproducibility, to incorporate missing values. Using simulations, we show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method using a single-cell RNA-seq dataset collected on HCT116 cells. We compare the reproducibility of different library preparation platforms and study the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility.


2013 ◽  
Author(s):  
Levent Dumenci ◽  
Robin Matsuyama ◽  
Robert Perera ◽  
Laura Kuhn ◽  
Laura Siminoff

2014 ◽  
Author(s):  
Miao Fan ◽  
Deli Zhao ◽  
Qiang Zhou ◽  
Zhiyuan Liu ◽  
Thomas Fang Zheng ◽  
...  

2019 ◽  
Author(s):  
Rina PY Lai ◽  
Michelle Renee Ellefson ◽  
Claire Hughes

Executive functions and metacognition are two cognitive predictors with well-established connections to academic performance. Despite sharing several theoretical characteristics, their overlap or independence concerning multiple academic outcomes remain under-researched. To address this gap, the present study applies a latent-variable approach to test a novel theoretical model that delineates the structural link between executive functions, metacognition, and academic outcomes. In whole-class sessions, 469 children aged 9 to 14 years (M = 11.93; SD = 0.92) completed four computerized executive function tasks (inhibition, working memory, cognitive flexibility, and planning), a self-reported metacognitive monitoring questionnaire, and three standardized tests of academic ability. The results suggest that executive functions and metacognitive monitoring are not interchangeable in the educational context and that they have both shared and unique contributions to diverse academic outcomes. The findings are important for elucidating the role between two domain-general cognitive skills (executive functions and metacognition) and domain-specific academic skills.


2021 ◽  
Vol 45 (3) ◽  
pp. 159-177
Author(s):  
Chen-Wei Liu

Missing not at random (MNAR) modeling for non-ignorable missing responses usually assumes that the latent variable distribution is a bivariate normal distribution. Such an assumption is rarely verified and often employed as a standard in practice. Recent studies for “complete” item responses (i.e., no missing data) have shown that ignoring the nonnormal distribution of a unidimensional latent variable, especially skewed or bimodal, can yield biased estimates and misleading conclusion. However, dealing with the bivariate nonnormal latent variable distribution with present MNAR data has not been looked into. This article proposes to extend unidimensional empirical histogram and Davidian curve methods to simultaneously deal with nonnormal latent variable distribution and MNAR data. A simulation study is carried out to demonstrate the consequence of ignoring bivariate nonnormal distribution on parameter estimates, followed by an empirical analysis of “don’t know” item responses. The results presented in this article show that examining the assumption of bivariate nonnormal latent variable distribution should be considered as a routine for MNAR data to minimize the impact of nonnormality on parameter estimates.


2013 ◽  
Vol 85 (4) ◽  
pp. 1346-1356 ◽  
Author(s):  
Marc H. Bornstein ◽  
Chun-Shin Hahn ◽  
Diane L. Putnick ◽  
Joan T. D. Suwalsky

Sign in / Sign up

Export Citation Format

Share Document