Mining Free Text for Structure

Ultimate Objective

Knowledge of the structural organization of information in documents can be of significant assistance to information systems that use documents as their knowledge bases. In particular, such knowledge is of use to information retrieval systems that retrieve documents in response to user queries. This chapter presents an approach to mining free-text documents for structure that is qualitative in nature. It complements the statistical and machine-learning approaches, insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers. The ultimate objective is to find scalable data mining (DM) solutions for free-text documents in exchange for modest knowledge-engineering requirements. The problem of mining free text for structure is addressed in the context of finding structural components of files of frequently asked questions (FAQs) associated with many USENET newsgroups. The chapter describes a system that mines FAQs for structural components. The chapter concludes with an outline of possible future trends in the structural mining of free text.

Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '88 ◽

Conceptual representation for knowledge bases and > information retrieval systems

10.1145/62437.62497 ◽

1988 ◽

Cited By ~ 1

Author(s):

G. P. Zarri

Keyword(s):

Information Retrieval ◽

Knowledge Bases ◽

Conceptual Representation ◽

Retrieval Systems ◽

Intelligent Information Retrieval ◽

Intelligent Information

Bayesian approach to incorporating different types of biomedical knowledge bases into information retrieval systems for clinical decision support in precision medicine

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2019.103238 ◽

2019 ◽

Vol 98 ◽

pp. 103238 ◽

Cited By ~ 1

Author(s):

Saeid Balaneshinkordan ◽

Alexander Kotov

Keyword(s):

Information Retrieval ◽

Decision Support ◽

Clinical Decision Support ◽

Bayesian Approach ◽

Clinical Decision ◽

Knowledge Bases ◽

Biomedical Knowledge ◽

Retrieval Systems ◽

Different Types ◽

Implementing the basic functions of free text information retrieval using binary relationships

Journal of Information Science ◽

10.1177/016555158300600504 ◽

1983 ◽

Vol 6 (5) ◽

pp. 165-172 ◽

Cited By ~ 2

Author(s):

F.N. Teskey

Keyword(s):

Information Retrieval ◽

User Interface ◽

Ad Hoc ◽

General Information ◽

Free Text ◽

Retrieval Systems ◽

Relationship Model ◽

Text Information ◽

Text Information Retrieval

In this paper the existing functions of, and a number of future requirements for, information retrieval systems are dis cussed. Two basic requirements for free text information retri eval systems have been identified; one for a more general information modelling language and the other for a simple user interface for complex ad-hoc queries. The paper describes some existing and proposed hardware and software methods for implementing free text information retrieval systems. Emphasis is placed on methods of improving the functionality of the system rather than on methods of increasing the performance. It is suggested that considerable improvements can be achieved by a more imaginative use of existing hardware, though it is realised that special purpose architectures will play an increas ingly important role in information systems. The paper con cludes with a design for a new information retrieval system based on the use of the Binary Relationship Model for infor mation storage and retrieval, and an interactive graphical dis play for the user interface.

A linked open data framework to enhance the discoverability and impact of culture heritage

Journal of Information Science ◽

10.1177/0165551518812658 ◽

2018 ◽

Vol 45 (6) ◽

pp. 756-766 ◽

Cited By ~ 4

Author(s):

Gustavo Candela ◽

Pilar Escobar ◽

Rafael C Carrasco ◽

Manuel Marco-Such

Keyword(s):

Information Retrieval ◽

Open Data ◽

Knowledge Bases ◽

Linked Open Data ◽

Miguel De Cervantes ◽

Data Framework ◽

Retrieval Systems ◽

All Solutions ◽

Culture Heritage

Cultural heritage institutions have recently begun to consider the benefits of sharing their collections using linked open data to disseminate and enrich their metadata. As datasets become very large, challenges appear, such as ingestion, management, querying and enrichment. Furthermore, each institution has particular features related to important aspects such as vocabularies and interoperability, which make it difficult to generalise this process and provide one-for-all solutions. In order to improve the user experience as regards information retrieval systems, researchers have identified that further refinements are required for the recognition and extraction of implicit relationships expressed in natural language. We introduce a framework for the enrichment and disambiguation of locations in text using open knowledge bases such as Wikidata and GeoNames. The framework has been successfully used to publish a dataset based on information from the Biblioteca Virtual Miguel de Cervantes, thus illustrating how semantic enrichment can help information retrieval. The methods applied in order to automate the enrichment process, which build upon open source software components, are described herein.

A knowledge representation language for large knowledge bases and “intelligent” information retrieval systems

Information Processing & Management ◽

10.1016/0306-4573(90)90096-k ◽

1990 ◽

Vol 26 (3) ◽

pp. 349-370 ◽

Cited By ~ 2

Author(s):

Gian Piero Zarri

Keyword(s):

Information Retrieval ◽

Knowledge Representation ◽

Knowledge Bases ◽

Retrieval Systems ◽

Intelligent Information Retrieval ◽

Knowledge Representation Language ◽

Representation Language ◽

Intelligent Information

An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition

Cybernetics and Information Technologies ◽

10.2478/cait-2018-0009 ◽

2018 ◽

Vol 18 (1) ◽

pp. 95-108 ◽

Cited By ~ 1

Author(s):

Klesti Hoxha ◽

Artur Baxhaku

Keyword(s):

News Media ◽

Named Entity Recognition ◽

Automatic Generation ◽

Knowledge Bases ◽

Entity Recognition ◽

High Demand ◽

Training Corpus ◽

Named Entity ◽

Retrieval Systems ◽

Abstract Named Entity Recognition (NER) is an important task in many NLP pipelines. It has become especially important for knowledge bases that power many of the nowadays information retrieval systems. In order to cope with the high demand for annotated training corpora for supervised NER systems, automatic generation approaches have been proposed. In this paper we report on the first automatically generated NE annotated corpus for Albanian. News articles from Albanian news media were used as a document source. They were automatically tagged using a custom generated gazetteer from the Albanian Wikipedia. Our evaluation results show that this corpus can be used as a baseline corpus for human annotated ones or as a training corpus where no other is available.

Structural Text Mining

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch472 ◽

2005 ◽

pp. 2658-2661

Author(s):

Vladimir A. Kulyukin ◽

John A. Nicholson

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Language Processing ◽

World Wide ◽

Knowledge Bases ◽

Domain Specific ◽

Retrieval Systems ◽

The World ◽

The advent of the World Wide Web has resulted in the creation of millions of documents containing unstructured, structured and semi-structured data. Consequently, research on structural text mining has come to the forefront of both information retrieval and natural language processing (Cardie, 1997; Freitag, 1998; Hammer, Garcia-Molina, Cho, Aranha, & Crespo, 1997; Hearst, 1992; Hsu & Chang, 1999; Jacquemin & Bush, 2000; Kushmerick, Weld, & Doorenbos, 1997). Knowledge of how information is organized and structured in texts can be of significant assistance to information systems that use documents as their knowledge bases (Appelt, 1999). In particular, such knowledge is of use to information retrieval systems (Salton & McGill, 1983) that retrieve documents in response to user queries and to systems that use texts to construct domain-specific ontologies or thesauri (Ruge, 1997).

A Model for Structured Data Entry Based on Explicit Descriptional Knowledge

Methods of Information in Medicine ◽

10.1055/s-0038-1635050 ◽

1994 ◽

Vol 33 (05) ◽

pp. 454-463 ◽

Cited By ~ 22

Author(s):

A. M. van Ginneken ◽

J. van der Lei ◽

J. H. van Bemmel ◽

P. W. Moorman

Keyword(s):

Domain Knowledge ◽

Data Entry ◽

Knowledge Bases ◽

Structured Data ◽

Research Quality ◽

Free Text ◽

Specific Knowledge ◽

User Input ◽

The One ◽

Research Quality Assessment

Abstract:Clinical narratives in patient records are usually recorded in free text, limiting the use of this information for research, quality assessment, and decision support. This study focuses on the capture of clinical narratives in a structured format by supporting physicians with structured data entry (SDE). We analyzed and made explicit which requirements SDE should meet to be acceptable for the physician on the one hand, and generate unambiguous patient data on the other. Starting from these requirements, we found that in order to support SDE, the knowledge on which it is based needs to be made explicit: we refer to this knowledge as descriptional knowledge. We articulate the nature of this knowledge, and propose a model in which it can be formally represented. The model allows the construction of specific knowledge bases, each representing the knowledge needed to support SDE within a circumscribed domain. Data entry is made possible through a general entry program, of which the behavior is determined by a combination of user input and the content of the applicable domain knowledge base. We clarify how descriptional knowledge is represented, modeled, and used for data entry to achieve SDE, which meets the proposed requirements.

Relevance Predictability in Information Retrieval Systems

Methods of Information in Medicine ◽

10.1055/s-0038-1636254 ◽

1967 ◽

Vol 06 (02) ◽

pp. 45-51 ◽

Cited By ~ 6

Author(s):

A. Kent ◽

J. Belzer ◽

M. Kuhfeerst ◽

E. D. Dym ◽

D. L. Shirey ◽

...

Keyword(s):

Information Retrieval ◽

Experimental Conditions ◽

Treatment Groups ◽

Retrieval Systems ◽

Significant Difference ◽

High Predictability ◽

Intermediate Response ◽

Quantitative Indicators ◽

Level Of Processing

An experiment is described which attempts to derive quantitative indicators regarding the potential relevance predictability of the intermediate stimuli used to represent documents in information retrieval systems. In effect, since the decision to peruse an entire document is often predicated upon the examination of one »level of processing« of the document (e.g., the citation and/or abstract), it became interesting to analyze the properties of what constitutes »relevance«. However, prior to such an analysis, an even more elementary step had to be made, namely, to determine what portions of a document should be examined.An evaluation of the ability of intermediate response products (IRPs), functioning as cues to the information content of full documents, to predict the relevance determination that would be subsequently made on these documents by motivated users of information retrieval systems, was made under controlled experimental conditions. The hypothesis that there might be other intermediate response products (selected extracts from the document, i.e., first paragraph, last paragraph, and the combination of first and last paragraph), that would be as representative of the full document as the traditional IRPs (citation and abstract) was tested systematically. The results showed that:1. there is no significant difference among the several IRP treatment groups on the number of cue evaluations of relevancy which match the subsequent user relevancy decision on the document;2. first and last paragraph combinations have consistently predicted relevancy to a higher degree than the other IRPs;3. abstracts were undistinguished as predictors; and4. the apparent high predictability rating for citations was not substantive.Some of these results are quite different than would be expected from previous work with unmotivated subjects.

Methods for Evaluating Interactive Information Retrieval Systems with Users

10.1561/9781601982254 ◽

2007 ◽

Author(s):

Diane Kelly

Keyword(s):

Information Retrieval ◽

Interactive Information Retrieval ◽

Retrieval Systems ◽