Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation (Preprint)

Mapping Intimacies ◽

10.2196/preprints.17376 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sijia Liu ◽

Yanshan Wang ◽

Andrew Wen ◽

Liwei Wang ◽

Na Hong ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Structured Data ◽

Common Data Model ◽

Concept System ◽

Unstructured Text ◽

Electronic Health

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.

Download Full-text

Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation

JMIR Medical Informatics ◽

10.2196/17376 ◽

2020 ◽

Vol 8 (10) ◽

pp. e17376 ◽

Cited By ~ 1

Author(s):

Sijia Liu ◽

Yanshan Wang ◽

Andrew Wen ◽

Liwei Wang ◽

Na Hong ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Structured Data ◽

Common Data Model ◽

Concept System ◽

Unstructured Text ◽

Electronic Health

Background Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. Objective In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). Methods CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. Results Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. Conclusions The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.

Download Full-text

A comparison of structured data query methods versus natural language processing to identify metastatic melanoma cases from electronic health records

International Journal of Computational Medicine and Healthcare ◽

10.1504/ijcmh.2019.104364 ◽

2019 ◽

Vol 1 (1) ◽

pp. 101 ◽

Cited By ~ 1

Author(s):

Jinghua He ◽

Lawrence Mark ◽

Charity Hilton ◽

Joel Martin ◽

Jarod Baker ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Metastatic Melanoma ◽

Language Processing ◽

Structured Data ◽

Health Records ◽

Data Query ◽

Electronic Health

Download Full-text

Indonesian Information Extraction : Challenges and Opportunities

JATISI (Jurnal Teknik Informatika dan Sistem Informasi) ◽

10.35957/jatisi.v8i1.710 ◽

2021 ◽

Vol 8 (1) ◽

pp. 421-429

Author(s):

Yan Puspitarani

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Research Trends ◽

Structured Data ◽

Daily Lives ◽

Unstructured Text ◽

Challenges And Opportunities ◽

Data Source

Information extraction is part of natural language processing, aiming to find, retrieve, or process information. The data source for information extraction is text. Text cannot be separated from people's daily lives. Through text, a lot of confidential information can be obtained. To produce information, the unstructured text will be converted into structured data. There are many approaches that researchers take to this process. Most of the studies are in English. Therefore, this paper will present current research trends, challenges, and information extraction opportunities using Indonesian.

Download Full-text

Natural language processing tool for automatic diseases and drugs recognition from electronic health records in polish- pilot study

European Heart Journal ◽

10.1093/eurheartj/ehab724.3169 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

C M Maciejewski ◽

M K Krajsman ◽

K O Ozieranski ◽

M B Basza ◽

M G Gawalko ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Structured Data ◽

Health Records ◽

Funding Sources ◽

Natural Language Processing Tool ◽

Electronic Health ◽

Automatic Tool

Abstract Background An estimate of 80% of data gathered in electronic health records is unstructured, textual information that cannot be utilized for research purposes until it is manually coded into a database. Manual coding is a both cost and time- consuming process. Natural language processing (NLP) techniques may be utilized for extraction of structured data from text. However, little is known about the accuracy of data obtained through these methods. Purpose To evaluate the possibility of employing NLP techniques in order to obtain data regarding risk factors needed for CHA2DS2VASc scale calculation and detection of antithrombotic medication prescribed in the population of atrial fibrillation (AF) patients of a cardiology ward. Methods An automatic tool for diseases and drugs recognition based on regular expressions rules was designed through cooperation of physicians and IT specialists. Records of 194 AF patients discharged from a cardiology ward were manually reviewed by a physician- annotator as a comparator for the automatic approach. Results Median CHA2DS2VASc score calculated by the automatic was 3 (IQR 2–4) versus 3 points (IQR 2–4) for the manual method (p=0.66). High agreement between CHA2DS2VASc scores calculated by both methods was present (Kendall's W=0.979; p<0.001). In terms of anticoagulant recognition, the automatic tool misqualified the drug prescribed in 4 cases. Conclusion NLP-based techniques are a promising tools for obtaining structured data for research purposes from electronic health records in polish. Tight cooperation of physicians and IT specialists is crucial for establishing accurate recognition patterns. Funding Acknowledgement Type of funding sources: None.

Download Full-text

Cohort profile: St. Michael's Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

10.1101/2020.09.11.20192419 ◽

2020 ◽

Author(s):

David Landsman ◽

Ahmed Abdelbasit ◽

Christine Wang ◽

Michael Guerzhoy ◽

Ujash Joshi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Epidemiology ◽

Relevant Information ◽

Structured Data ◽

Free Text ◽

Electronic Health Record Data ◽

Latent Tb ◽

Electronic Health

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael's Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael's Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N=3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4-21.2%) were diagnosed with active TB and 46% (95% CI: 43.8-47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8-19.7%) and 40% (95% CI: 37.8-41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modelling studies.

Download Full-text

Cohort profile: St. Michael’s Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

PLoS ONE ◽

10.1371/journal.pone.0247872 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0247872

Author(s):

David Landsman ◽

Ahmed Abdelbasit ◽

Christine Wang ◽

Michael Guerzhoy ◽

Ujash Joshi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Epidemiology ◽

Relevant Information ◽

Structured Data ◽

Free Text ◽

Electronic Health Record Data ◽

Latent Tb ◽

Electronic Health

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael’s Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael’s Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N = 3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4–21.2%) were diagnosed with active TB and 46% (95% CI: 43.8–47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8–19.7%) and 40% (95% CI: 37.8–41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data by using NLP rulesets. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modeling studies.

Download Full-text

Common data model for natural language processing based on two existing standard information models: CDA+GrAF

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2011.11.018 ◽

2012 ◽

Vol 45 (4) ◽

pp. 703-710 ◽

Cited By ~ 2

Author(s):

Stéphane M. Meystre ◽

Sanghoon Lee ◽

Chai Young Jung ◽

Raphaël D. Chevrier

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Common Data Model ◽

Information Models ◽

Standard Information

Download Full-text

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

10.1145/3342827 ◽

2019 ◽

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

International Conference

Download Full-text

Thai Fake News Detection Based on Information Retrieval, Natural Language Processing and Machine Learning

SN Computer Science ◽

10.1007/s42979-021-00775-6 ◽

2021 ◽

Vol 2 (6) ◽

Author(s):

Phayung Meesad

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fake News

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text