scholarly journals Span Model for Open Information Extraction on Accurate Corpus

2020 ◽  
Vol 34 (05) ◽  
pp. 9523-9530
Author(s):  
Junlang Zhan ◽  
Hai Zhao

Open Information Extraction (Open IE) is a challenging task especially due to its brittle data basis. Most of Open IE systems have to be trained on automatically built corpus and evaluated on inaccurate test set. In this work, we first alleviate this difficulty from both sides of training and test sets. For the former, we propose an improved model design to more sufficiently exploit training dataset. For the latter, we present our accurately re-annotated benchmark test set (Re-OIE2016) according to a series of linguistic observation and analysis. Then, we introduce a span model instead of previous adopted sequence labeling formulization for n-ary Open IE. Our newly introduced model achieves new state-of-the-art performance on both benchmark evaluation datasets.

Information ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 228 ◽  
Author(s):  
Daniela Barreiro Claro ◽  
Marlo Souza ◽  
Clarissa Castellã Xavier ◽  
Leandro Oliveira

The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods have dealt with features from a unique language; however, few approaches tackle multilingual aspects. In those approaches, multilingualism is restricted to processing text in different languages, rather than exploring cross-linguistic resources, which results in low precision due to the use of general rules. Multilingual methods have been applied to numerous problems in Natural Language Processing, achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We argue that a multilingual approach can enhance OIE methods as it is ideal to evaluate and compare OIE systems, and therefore can be applied to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems.


2020 ◽  
Vol 10 (16) ◽  
pp. 5630
Author(s):  
Dimitris Papadopoulos ◽  
Nikolaos Papadakis ◽  
Antonis Litke

The usefulness of automated information extraction tools in generating structured knowledge from unstructured and semi-structured machine-readable documents is limited by challenges related to the variety and intricacy of the targeted entities, the complex linguistic features of heterogeneous corpora, and the computational availability for readily scaling to large amounts of text. In this paper, we argue that the redundancy and ambiguity of subject–predicate–object (SPO) triples in open information extraction systems has to be treated as an equally important step in order to ensure the quality and preciseness of generated triples. To this end, we propose a pipeline approach for information extraction from large corpora, encompassing a series of natural language processing tasks. Our methodology consists of four steps: i. in-place coreference resolution, ii. extractive text summarization, iii. parallel triple extraction, and iv. entity enrichment and graph representation. We manifest our methodology on a large medical dataset (CORD-19), relying on state-of-the-art tools to fulfil the aforementioned steps and extract triples that are subsequently mapped to a comprehensive ontology of biomedical concepts. We evaluate the effectiveness of our information extraction method by comparing it in terms of precision, recall, and F1-score with state-of-the-art OIE engines and demonstrate its capabilities on a set of data exploration tasks.


Author(s):  
Daniela Barreiro Claro ◽  
Marlo Souza ◽  
Clarissa Castellã Xavier ◽  
Leandro Oliveira

The number of documents published on the Web other languages than English grows every year. As a consequence, it increases the necessity of extracting useful information from different languages, pointing out the importance of researching Open Information Extraction (OIE) techniques. Different OIE methods have been dealing with features from a unique language. On the other hand, few approaches tackle multilingual aspects. In such approaches, multilingual is only treated as an extraction method, which results in low precision due to the use of general rules. Multilingual methods have been applied to a vast amount of problems in Natural Language Processing achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We state that a multilingual approach can enhance OIE methods, being ideal to evaluate and compare OIE systems, and as a consequence, to applying it to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase the acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning the state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems.


2021 ◽  
pp. 1-12
Author(s):  
Rafael Gallardo García ◽  
Beatriz Beltrán Martínez ◽  
Carlos Hernández Gracidas ◽  
Darnes Vilariño Ayala

Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work introduces two alternative versions (L-M4C and L-CNMT) of top architectures (on the TextCaps challenge), which were mainly adapted to achieve near-State-of-The-Art performance while being memory-lighter when compared to the original architectures, this is mainly achieved by using distilled or smaller pre-trained models on the text-and-OCR embedding modules. On the one hand, a distilled version of BERT was used in order to reduce the size of the text-embedding module (the distilled model has 59% fewer parameters), on the other hand, the OCR context processor on both architectures was replaced by Global Vectors (GloVe), instead of using FastText pre-trained vectors, this can reduce the memory used by the OCR-embedding module up to a 94% . Two of the three models presented in this work surpassed the baseline (M4C-Captioner) of the challenge on the evaluation and test sets, also, our best lighter architecture reached a CIDEr score of 88.24 on the test set, which is 7.25 points above the baseline model.


2020 ◽  
Vol 6 (6) ◽  
pp. 41 ◽  
Author(s):  
Björn Barz ◽  
Joachim Denzler

The CIFAR-10 and CIFAR-100 datasets are two of the most heavily benchmarked datasets in computer vision and are often used to evaluate novel methods and model architectures in the field of deep learning. However, we find that 3.3% and 10% of the images from the test sets of these datasets have duplicates in the training set. These duplicates are easily recognizable by memorization and may, hence, bias the comparison of image recognition techniques regarding their generalization capability. To eliminate this bias, we provide the “fair CIFAR” (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. The training set remains unchanged, in order not to invalidate pre-trained models. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. We find a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set. We make both the ciFAIR dataset and pre-trained models publicly available and furthermore maintain a leaderboard for tracking the state of the art.


2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


2020 ◽  
pp. 1-21 ◽  
Author(s):  
Clément Dalloux ◽  
Vincent Claveau ◽  
Natalia Grabar ◽  
Lucas Emanuel Silva Oliveira ◽  
Claudia Maria Cabral Moro ◽  
...  

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.


Sign in / Sign up

Export Citation Format

Share Document