Information retrieval for unstructured text documents in Serbian into the crime domain

Author(s):  
Vojkan Nikolic ◽  
Branko Markoski ◽  
Miodrag Ivkovic ◽  
Kristijan Kuk ◽  
Predrag Djikanovic
Author(s):  
Byung-Kwon Park ◽  
Il-Yeol Song

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.


2011 ◽  
Vol 20 (06) ◽  
pp. 1157-1170 ◽  
Author(s):  
BOUCHRA FRIKH ◽  
AHMED SAID DJAANFAR ◽  
BRAHIM OUHBI

Resources like ontologies are used in a number of applications, including natural language processing, information retrieval(especially from the Internet). Different methods have been proposed to build such resources. This paper proposes a new method to extract information from the Web to build a taxonomy of terms and Web resources for a given domain. Firstly, a (CHIR) method is used to identify candidat terms. Then a similarity (SIM) measure is introduced to select relevant concepts to build the ontology. Our new algorithm, called (CHIRSIM), is easy to implement and can be efficiently integrated into an information retrieval system to help improve the retrieval performance. Experimental results show that the proposed approach can effectively and efficiently construct a cancer domain ontology from unstructured text documents.


Author(s):  
Sijia Liu ◽  
Yanshan Wang ◽  
Andrew Wen ◽  
Liwei Wang ◽  
Na Hong ◽  
...  

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.


Author(s):  
Furkan Goz ◽  
Alev Mutlu

Keyword indexing is the problem of assigning keywords to text documents. It is an important task as keywords play crucial roles in several information retrieval tasks. The problem is also challenging as the number of text documents is increasing, and such documents come in different forms (i.e., scientific papers, online news articles, and microblog posts). This chapter provides an overview of keyword indexing and elaborates on keyword extraction techniques. The authors provide the general motivations behind the supervised and the unsupervised keyword extraction and enumerate several pioneering and state-of-the-art techniques. Feature engineering, evaluation metrics, and benchmark datasets used to evaluate the performance of keyword extraction systems are also discussed.


2010 ◽  
Vol 37 (12) ◽  
pp. 8348-8358 ◽  
Author(s):  
Francisco Bueno ◽  
Ana García-Serrano ◽  
José L. Martínez-Fernández

2016 ◽  
Vol 9 (2) ◽  
pp. 88
Author(s):  
Saiful Bahri Musa ◽  
Andi Baso Kaswar ◽  
Supria Supria ◽  
Susiana Sari

One of ways to facilitate process of information retrieval is by performing clustering toward collection of the existing documents. The existing text documents are often unstructured. The forms are varied and their groupings are ambiguous. This cases cause difficulty on information retrieval process. Moreover, every second new documents emerge and need to be clustered. Generally, static document clustering method performs clustering of document after whole documents are collected. However, performing re-clustering toward whole documents when new document arrives causes inefficient clustering process. In this paper, we proposed a new method for document clustering with dynamic hierarchy algorithm based on fuzzy set type - II from frequent itemset. To achieve the goals, there are three main phases, namely: determination of key-term, the extraction of candidates clusters and cluster hierarchical construction. Based on the experiment, it resulted the value of F-measure 0.40 for Newsgroup, 0.62 for Classic and 0.38 for Reuters. Meanwhile, time of computation when addition of new document is lower than to the previous static method. The result shows that this method is suitable to produce solution of clustering with hierarchy in dynamical environment effectively and efficiently. This method also gives accurate clustering result.


Sign in / Sign up

Export Citation Format

Share Document