FINDING STRUCTURED AND UNSTRUCTURED FEATURES TO IMPROVE THE SEARCH RESULT OF COMPLEX QUESTION

The current researches on question answer usually achieve the answer only from unstructured text resources such as collection of news or pages. According to our observation from Yahoo!Answer, users sometimes ask in complex natural language questions which contain structured and unstructured features. Generally, answering the complex questions needs to consider not only unstructured but also structured resource. In this work, researcher propose a new idea to improve accuracy of the answers of complex questions by recognizing the structured and unstructured features of questions and them in the web. Our framework consists of three parts: Question Analysis, Resource Discovery, and Analysis of The Relevant Answer. In Question Analysis researcher used a few assumptions and tried to find structured and unstructured features of the questions. In the resource discovery researcher integrated structured data (relational database) and unstructured data (web page) to take the advantage of two kinds of data to improve and to get the correct answers. We can find the best top fragments from context of the relevant web pages in the Relevant Answer part and then researcher made a score matching between the result from structured data and unstructured data, then finally researcher used QA template to reformulate the questions. Penelitian yang ada pada saat ini mengenai Question Answer (QA) biasanya mendapatkan jawaban dari sumber teks yang tidak terstruktur seperti kumpulan berita atau halaman. Sesuai dengan observasi peneliti dari pengguna Yahoo!Answer, biasanya mereka bertanya dalam natural language yang sangat kompleks di mana mengandung bentuk yang terstruktur dan tidak terstruktur. Secara umum, menjawab pertanyaan yang kompleks membutuhkan pertimbangan yang tidak hanya sumber tidak terstruktur tetapi juga sumber yang terstruktur. Pada penelitian ini, peneliti mengajukan suatu ide baru untuk meningkatkan keakuratan dari jawaban pertanyaan yang kompleks dengan mengenali bentuk terstruktur dan tidak terstruktur dan mengintegrasikan keduanya di web. Framework yang digunakan terdiri dari tiga bagian: Question Analysis, Resource Discovery, dan Analysis of The Relevant Answer. Pada Question Analysis peneliti menggunakan beberapa asumsi dan mencoba mencari bentuk data yang terstruktur dan tidak terstruktur. Dalam penemuan sumber daya, peneliti mengintegrasikan data terstruktur (relational database) dan data tidak terstruktur (halaman web) untuk mengambil keuntungan dari dua jenis data untuk meningkatkan dan untuk mencapai jawaban yang benar. Peneliti dapat menemukan fragmen atas terbaik dari konteks halaman web pada bagian Relevant Answer dan kemudian peneliti membuat pencocoka skor antara hasil dari data terstruktur dan data tidak terstruktur. Terakhir peneliti menggunakan template QA untuk merumuskan pertanyaan.

Download Full-text

Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation (Preprint)

10.2196/preprints.17376 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sijia Liu ◽

Yanshan Wang ◽

Andrew Wen ◽

Liwei Wang ◽

Na Hong ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Structured Data ◽

Common Data Model ◽

Concept System ◽

Unstructured Text ◽

Electronic Health

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.

Download Full-text

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Methods of Information in Medicine ◽

10.1055/s-0040-1716403 ◽

2020 ◽

Vol 59 (S 02) ◽

pp. e64-e78

Author(s):

Antje Wulff ◽

Marcel Mast ◽

Marcus Hassler ◽

Sara Montag ◽

Michael Marschollek ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Data ◽

Expert Knowledge ◽

Clinical Information ◽

Structured Data ◽

Unstructured Data ◽

Mapping Rules ◽

Medical Histories

Abstract Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly. Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories. Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School. Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall. Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Download Full-text

Latent Dirichlet Allocation in predicting clinical trial terminations

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0973-y ◽

2019 ◽

Vol 19 (1) ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source

Ontology Learning and Knowledge Discovery Using the Web ◽

10.4018/978-1-60960-625-1.ch004 ◽

2011 ◽

pp. 61-78 ◽

Cited By ~ 4

Author(s):

Louis Massey ◽

Wilson Wong

Keyword(s):

Natural Language ◽

Language Processing ◽

Knowledge Bases ◽

Human Cognition ◽

Web Pages ◽

Topic Identification ◽

Unstructured Text ◽

Text Information ◽

Processing Techniques ◽

The Web

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.

Download Full-text

Inducing Schema.org markup from Natural Language Context

10.29007/fvc9 ◽

2019 ◽

Author(s):

Gautam Kishore Shahi ◽

Durgesh Nandini ◽

Sushma Kumari

Keyword(s):

Natural Language ◽

Knowledge Base ◽

Structured Data ◽

Knowledge Graph ◽

Web Pages ◽

Web Data ◽

Experimental Part ◽

Language Context ◽

Data Commons ◽

The Web

Schema.org creates, supports and maintain schemas for structured data on the web pages. For a non-technical author, it is difficult to publish contents in a structured format. This work presents an automated way of inducing Schema.org markup from natural language context of web-pages by applying knowledge base creation technique. As a dataset, Web Data Commons was used, and the scope for the experimental part was limited to RDFa. The approach was implemented using the Knowledge Graph building techniques - Knowledge Vault and KnowMore.

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Terminations

10.21203/rs.2.12904/v4 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

Indonesian Information Extraction : Challenges and Opportunities

JATISI (Jurnal Teknik Informatika dan Sistem Informasi) ◽

10.35957/jatisi.v8i1.710 ◽

2021 ◽

Vol 8 (1) ◽

pp. 421-429

Author(s):

Yan Puspitarani

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Research Trends ◽

Structured Data ◽

Daily Lives ◽

Unstructured Text ◽

Challenges And Opportunities ◽

Data Source

Information extraction is part of natural language processing, aiming to find, retrieve, or process information. The data source for information extraction is text. Text cannot be separated from people's daily lives. Through text, a lot of confidential information can be obtained. To produce information, the unstructured text will be converted into structured data. There are many approaches that researchers take to this process. Most of the studies are in English. Therefore, this paper will present current research trends, challenges, and information extraction opportunities using Indonesian.

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Terminations

10.21203/rs.2.12904/v3 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background: This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

Role of SPARQL in Leveraging Sematic Web Technology

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5161.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 26-30

Keyword(s):

Heterogeneous Data ◽

Structured Data ◽

Unstructured Data ◽

Web Pages ◽

Query Execution ◽

Web Content ◽

Main Challenge ◽

Text Images ◽

The Web

Semantic web is not just a matter of translation from HTML to RDF/OWL languages. It is a matter of understanding the content of the web through knowledge graphs. Entities need to be related with relationships. This content is composed of resources (web pages) that contain, for example, text, images and audio. Thus, there is the need of extracting entities from these resources. Currently, most of the web content is in HTML5 format which is a W3C recommendation which enables describing the structure marginally with the help of annotations. The main challenge here is to transform unstructured data from plain HTML files to structured data (e.g RDF or OWL). The current work provides the first hand information for dealing with unstructured heterogeneous data residing on web using Twinkle, a Java tool for SPARQL query execution on FOAF (Friend Of A Friend) document.

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Failures

10.21203/rs.2.12904/v1 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Peer Review Process ◽

Structured Data ◽

Unstructured Data ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background : This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method : We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derived 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results : In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions : Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies.

Download Full-text