Mining Free Text for Structure

Data Mining ◽  
2011 ◽  
pp. 278-300
Author(s):  
Vladimir A. Kulyukin ◽  
Robin Burke

Knowledge of the structural organization of information in documents can be of significant assistance to information systems that use documents as their knowledge bases. In particular, such knowledge is of use to information retrieval systems that retrieve documents in response to user queries. This chapter presents an approach to mining free-text documents for structure that is qualitative in nature. It complements the statistical and machine-learning approaches, insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers. The ultimate objective is to find scalable data mining (DM) solutions for free-text documents in exchange for modest knowledge-engineering requirements. The problem of mining free text for structure is addressed in the context of finding structural components of files of frequently asked questions (FAQs) associated with many USENET newsgroups. The chapter describes a system that mines FAQs for structural components. The chapter concludes with an outline of possible future trends in the structural mining of free text.

1983 ◽  
Vol 6 (5) ◽  
pp. 165-172 ◽  
Author(s):  
F.N. Teskey

In this paper the existing functions of, and a number of future requirements for, information retrieval systems are dis cussed. Two basic requirements for free text information retri eval systems have been identified; one for a more general information modelling language and the other for a simple user interface for complex ad-hoc queries. The paper describes some existing and proposed hardware and software methods for implementing free text information retrieval systems. Emphasis is placed on methods of improving the functionality of the system rather than on methods of increasing the performance. It is suggested that considerable improvements can be achieved by a more imaginative use of existing hardware, though it is realised that special purpose architectures will play an increas ingly important role in information systems. The paper con cludes with a design for a new information retrieval system based on the use of the Binary Relationship Model for infor mation storage and retrieval, and an interactive graphical dis play for the user interface.


2018 ◽  
Vol 45 (6) ◽  
pp. 756-766 ◽  
Author(s):  
Gustavo Candela ◽  
Pilar Escobar ◽  
Rafael C Carrasco ◽  
Manuel Marco-Such

Cultural heritage institutions have recently begun to consider the benefits of sharing their collections using linked open data to disseminate and enrich their metadata. As datasets become very large, challenges appear, such as ingestion, management, querying and enrichment. Furthermore, each institution has particular features related to important aspects such as vocabularies and interoperability, which make it difficult to generalise this process and provide one-for-all solutions. In order to improve the user experience as regards information retrieval systems, researchers have identified that further refinements are required for the recognition and extraction of implicit relationships expressed in natural language. We introduce a framework for the enrichment and disambiguation of locations in text using open knowledge bases such as Wikidata and GeoNames. The framework has been successfully used to publish a dataset based on information from the Biblioteca Virtual Miguel de Cervantes, thus illustrating how semantic enrichment can help information retrieval. The methods applied in order to automate the enrichment process, which build upon open source software components, are described herein.


2018 ◽  
Vol 18 (1) ◽  
pp. 95-108 ◽  
Author(s):  
Klesti Hoxha ◽  
Artur Baxhaku

Abstract Named Entity Recognition (NER) is an important task in many NLP pipelines. It has become especially important for knowledge bases that power many of the nowadays information retrieval systems. In order to cope with the high demand for annotated training corpora for supervised NER systems, automatic generation approaches have been proposed. In this paper we report on the first automatically generated NE annotated corpus for Albanian. News articles from Albanian news media were used as a document source. They were automatically tagged using a custom generated gazetteer from the Albanian Wikipedia. Our evaluation results show that this corpus can be used as a baseline corpus for human annotated ones or as a training corpus where no other is available.


Author(s):  
Vladimir A. Kulyukin ◽  
John A. Nicholson

The advent of the World Wide Web has resulted in the creation of millions of documents containing unstructured, structured and semi-structured data. Consequently, research on structural text mining has come to the forefront of both information retrieval and natural language processing (Cardie, 1997; Freitag, 1998; Hammer, Garcia-Molina, Cho, Aranha, & Crespo, 1997; Hearst, 1992; Hsu & Chang, 1999; Jacquemin & Bush, 2000; Kushmerick, Weld, & Doorenbos, 1997). Knowledge of how information is organized and structured in texts can be of significant assistance to information systems that use documents as their knowledge bases (Appelt, 1999). In particular, such knowledge is of use to information retrieval systems (Salton & McGill, 1983) that retrieve documents in response to user queries and to systems that use texts to construct domain-specific ontologies or thesauri (Ruge, 1997).


1994 ◽  
Vol 33 (05) ◽  
pp. 454-463 ◽  
Author(s):  
A. M. van Ginneken ◽  
J. van der Lei ◽  
J. H. van Bemmel ◽  
P. W. Moorman

Abstract:Clinical narratives in patient records are usually recorded in free text, limiting the use of this information for research, quality assessment, and decision support. This study focuses on the capture of clinical narratives in a structured format by supporting physicians with structured data entry (SDE). We analyzed and made explicit which requirements SDE should meet to be acceptable for the physician on the one hand, and generate unambiguous patient data on the other. Starting from these requirements, we found that in order to support SDE, the knowledge on which it is based needs to be made explicit: we refer to this knowledge as descriptional knowledge. We articulate the nature of this knowledge, and propose a model in which it can be formally represented. The model allows the construction of specific knowledge bases, each representing the knowledge needed to support SDE within a circumscribed domain. Data entry is made possible through a general entry program, of which the behavior is determined by a combination of user input and the content of the applicable domain knowledge base. We clarify how descriptional knowledge is represented, modeled, and used for data entry to achieve SDE, which meets the proposed requirements.


1967 ◽  
Vol 06 (02) ◽  
pp. 45-51 ◽  
Author(s):  
A. Kent ◽  
J. Belzer ◽  
M. Kuhfeerst ◽  
E. D. Dym ◽  
D. L. Shirey ◽  
...  

An experiment is described which attempts to derive quantitative indicators regarding the potential relevance predictability of the intermediate stimuli used to represent documents in information retrieval systems. In effect, since the decision to peruse an entire document is often predicated upon the examination of one »level of processing« of the document (e.g., the citation and/or abstract), it became interesting to analyze the properties of what constitutes »relevance«. However, prior to such an analysis, an even more elementary step had to be made, namely, to determine what portions of a document should be examined.An evaluation of the ability of intermediate response products (IRPs), functioning as cues to the information content of full documents, to predict the relevance determination that would be subsequently made on these documents by motivated users of information retrieval systems, was made under controlled experimental conditions. The hypothesis that there might be other intermediate response products (selected extracts from the document, i.e., first paragraph, last paragraph, and the combination of first and last paragraph), that would be as representative of the full document as the traditional IRPs (citation and abstract) was tested systematically. The results showed that:1. there is no significant difference among the several IRP treatment groups on the number of cue evaluations of relevancy which match the subsequent user relevancy decision on the document;2. first and last paragraph combinations have consistently predicted relevancy to a higher degree than the other IRPs;3. abstracts were undistinguished as predictors; and4. the apparent high predictability rating for citations was not substantive.Some of these results are quite different than would be expected from previous work with unmotivated subjects.


Sign in / Sign up

Export Citation Format

Share Document