Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

The purpose of this study is to develop a predictive model that increases the accuracy of business operational planning using data from a small business. By using Machine Learning (ML) techniques feature expansion, resampling, and combination techniques, it was possible to address several existing limitations in the available research. Then, the use of the novel technique of feature engineering allowed us to increase the accuracy of the model by finding 10 new features derived from the original ones and constructed automatically through the nonlinear relationships found between them. Finally, we built a rule-based classifier to predict the store's revenue with high accuracy. The results show the proposed approach open new possibilities for ML research applied to small and medium businesses.

Download Full-text

The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study

JMIR Medical Informatics ◽

10.2196/18953 ◽

2020 ◽

Vol 8 (12) ◽

pp. e18953

Author(s):

Renzo Rivera Zavala ◽

Paloma Martinez

Keyword(s):

Machine Learning ◽

Information Extraction ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Special Focus ◽

Rule Based ◽

Clinical Narrative ◽

Cross Lingual ◽

The Impact

Background Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language. Objective As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step. Methods We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER. Results The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT. Conclusions These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning–based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.

Download Full-text

The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study (Preprint)

10.2196/preprints.18953 ◽

2020 ◽

Author(s):

Renzo Rivera Zavala ◽

Paloma Martinez

Keyword(s):

Machine Learning ◽

Information Extraction ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Special Focus ◽

Rule Based ◽

Clinical Narrative ◽

Cross Lingual ◽

The Impact

BACKGROUND Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language. OBJECTIVE As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step. METHODS We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER. RESULTS The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT. CONCLUSIONS These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning–based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.

Download Full-text

Toward Complete Structured Information Extraction from Radiology Reports Using Machine Learning

Journal of Digital Imaging ◽

10.1007/s10278-019-00234-y ◽

2019 ◽

Vol 32 (4) ◽

pp. 554-564 ◽

Cited By ~ 7

Author(s):

Jackson M. Steinkamp ◽

Charles Chambers ◽

Darco Lalevic ◽

Hanna M. Zafar ◽

Tessa S. Cook

Keyword(s):

Machine Learning ◽

Information Extraction ◽

Radiology Reports ◽

Structured Information

Download Full-text

Information extraction on novel text using machine learning and rule-based system

2017 International Conference on Innovative and Creative Information Technology (ICITech) ◽

10.1109/innocit.2017.8319148 ◽

2017 ◽

Cited By ~ 2

Author(s):

Ria Chaniago ◽

Masayu Leylia Khodra

Keyword(s):

Machine Learning ◽

Information Extraction ◽

Rule Based ◽

Rule Based System

Download Full-text

A Brief Survey on Text Classification Using Various Machine Learning Techniques

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v8i1.521 ◽

2018 ◽

Vol 8 (1) ◽

pp. 14

Author(s):

Padmavathi .S ◽

M. Chidambaram

Keyword(s):

Machine Learning ◽

Text Classification ◽

Fixed Number ◽

Machine Learning Techniques ◽

Online Information ◽

Rule Based ◽

Learning Techniques ◽

Machine Learning Approach ◽

Rule Based Approach

Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text