scholarly journals Datasets for generic relation extraction

2011 ◽  
Vol 18 (1) ◽  
pp. 21-59 ◽  
Author(s):  
B. HACHEY ◽  
C. GROVER ◽  
R. TOBIN

AbstractA vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from ‘one’ to ‘Amidu Berry’ in the membership relation described in ‘Amidu Berry, one half of PBS’. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora.1

2013 ◽  
Vol 10 (1) ◽  
pp. 79-104
Author(s):  
Guillem Rull ◽  
Carles Farré ◽  
Ernest Teniente ◽  
Toni Urpí

With the emergence of the Web and the wide use of XML for representing data, the ability to map not only flat relational but also nested data has become crucial. The design of schema mappings is a semi-automatic process. A human designer is needed to guide the process, choose among mapping candidates, and successively refine the mapping. The designer needs a way to figure out whether the mapping is what was intended. Our approach to mapping validation allows the designer to check whether the mapping satisfies certain desirable properties. In this paper, we focus on the validation of mappings between nested relational schemas, in which the mapping assertions are either inclusions or equalities of nested queries. We focus on the nested relational setting since most XML?s Document Type Definitions (DTDs) can be represented in this model. We perform the validation by reasoning on the schemas and mapping definition. We take into account the integrity constraints defined on both the source and target schema. We consider constraints and mapping?s queries which may contain arithmetic comparisons and negations. This class of mapping scenarios is significantly more expressive than the ones addressed by previous work on nested relational mapping validation. We encode the given mapping scenario into a single flat database schema, so we can take advantage of our previous work on validating flat relational mappings, and reformulate each desirable property check as a query satisfiability problem.


Author(s):  
Till F. Sonnemann ◽  
Douglas C. Comer ◽  
Jesse L. Patsolic ◽  
William P. Megarry ◽  
Eduardo Herrera Malatesta ◽  
...  

Satellite imagery has had limited application in the analysis of pre-colonial settlement archaeology in the Caribbean; visible evidence of wooden structures perishes quickly in tropical climates. Only slight topographic modifications remain, typically associated with middens. Nonetheless, surface scatters, as well as the soil characteristics they produce, can serve as quantifiable indicators of an archaeological site, which can be detected by analysis of remote sensing imagery. A variety of data sets were investigated, with the intention to combine multispectral bands to feed a direct detection algorithm, providing a semi-automatic process to cross-correlate the datasets. Sampling was done using locations of known sites, as well as areas with no archaeological evidence. The pre-processed very diverse remote sensing data sets have gone through a process of image registration. The algorithm was applied in the northwestern Dominican Republic on areas that included different types of environments, chosen for having sufficient imagery coverage, and a representative number of known locations of indigenous sites. The resulting maps present quantifiable statistical results of locations with similar pixel value combinations as the identified sites, indicating higher probability of archaeological evidence. The results show the variable potential of this method in diverse environments.


2020 ◽  
Vol 17 (6) ◽  
pp. 916-925
Author(s):  
Niyati Behera ◽  
Guruvayur Mahalakshmi

Attributes, whether qualitative or non-qualitative are the formal description of any real-world entity and are crucial in modern knowledge representation models like ontology. Though ample evidence for the amount of research done for mining non-qualitative attributes (like part-of relation) extraction from text as well as the Web is available in the wealth of literature, on the other side limited research can be found relating to qualitative attribute (i.e., size, color, taste etc.,) mining. Herein this research article an analytical framework has been proposed to retrieve qualitative attribute values from unstructured domain text. The research objective covers two aspects of information retrieval (1) acquiring quality values from unstructured text and (2) then assigning attribute to them by comparing the Google derived meaning or context of attributes as well as quality value (adjectives). The goal has been accomplished by using a framework which integrates Vector Space Modelling (VSM) with a probabilistic Multinomial Naive Bayes (MNB) classifier. Performance Evaluation has been carried out on two data sets (1) HeiPLAS Development Data set (106 adjective-noun exemplary phrases) and (2) a text data set in Medicinal Plant Domain (MPD). System is found to perform better with probabilistic approach compared to the existing pattern-based framework in the state of art


2020 ◽  
Vol 21 (4) ◽  
pp. 375-405
Author(s):  
Michael Färber ◽  
Adam Jatowt

Abstract Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction to automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles.


Author(s):  
Y. SARATH KUMAR ◽  
ESWAR KODALI ◽  
P. HARINI

In this paper we proposed a lexical-pattern-based approach to extract aliases of a given name. We use a set of names and their aliases as training data to extract lexical patterns that describe numerous ways in which information related to aliases of a name is presented on the web. An individual is typically referred by numerous name aliases on the web. Accurate identification of aliases of a given person name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name disambiguation, and relation extraction. We propose a method to extract aliases of a given personal name from the web. Given a personal name, the proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a candidate being a correct alias of the given name. We evaluate the proposed method on three data sets: an English personal names data set, an English place names data set, and a Japanese personal names data set. The proposed method outperforms numerous baselines and previously proposed name alias extraction methods, achieving a statistically significant mean reciprocal rank (MRR) of 0.67.


2012 ◽  
pp. 191-215
Author(s):  
Mona Sleem-Amer ◽  
Ivan Bigorgne ◽  
Stéphanie Brizard ◽  
Leeley Daio Pires Dos Santos ◽  
Yacine El Bouhairi ◽  
...  

Over the last years, research and industry players have become increasingly interested in analyzing opinions and sentiments expressed on the social media web for product marketing and business intelligence. In order to adapt to this need search engines not only have to be able to retrieve lists of documents but to directly access, analyze, and interpret topics and opinions. This article covers an intermediate phase of the ongoing industrial research project ’DoXa’ aiming at developing a semantic opinion and sentiment mining search engine for the French language. The DoXa search engine enables topic related opinion and sentiment extraction beyond positive and negative polarity using rich linguistic resources. Centering the work on two distinct business use cases, the authors analyze both unstructured Web 2.0 contents (e.g., blogs and forums) and structured questionnaire data sets. The focus is on discovering hidden patterns in the data. To this end, the authors present work in progress on opinion topic relation extraction and visual analytics, linguistic resource construction as well as the combination of OLAP technology with semantic search.


Sign in / Sign up

Export Citation Format

Share Document