Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach

Sentiment Classification is one of the well-known and most popular domain of machine learning and natural language processing. An algorithm is developed to understand the opinion of an entity similar to human beings. This research fining article presents a similar to the mention above. Concept of natural language processing is considered for text representation. Later novel word embedding model is proposed for effective classification of the data. Tf-IDF and Common BoW representation models were considered for representation of text data. Importance of these models are discussed in the respective sections. The proposed is testing using IMDB datasets. 50% training and 50% testing with three random shuffling of the datasets are used for evaluation of the model.

Download Full-text

Semi-Supervised Natural Language Processing Approach for Fine-Grained Classification of Medical Reports

10.1109/urtc49097.2019.9660430 ◽

2019 ◽

Author(s):

Neil Deshmukh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fine Grained ◽

Processing Approach ◽

Medical Reports

Download Full-text

Abstract P259: Using Natural Language Processing and Machine Learning to Identify Incident Stroke From Electronic Health Records

Circulation ◽

10.1161/circ.141.suppl_1.p259 ◽

2020 ◽

Vol 141 (Suppl_1) ◽

Author(s):

Yiqing Zhao ◽

Sunyang Fu ◽

Suzette J Bielinski ◽

Paul Decker ◽

Alanna M Chamberlain ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Language Processing ◽

Stroke Incidence ◽

Clinical Notes ◽

Feature Sets ◽

Electronic Health

Background: The focus of most existing phenotyping algorithms based on electronic health record (EHR) data has been to accurately identify cases and non-cases of specific diseases. However, a more challenging task is to accurately identify disease incidence, as identifying the first occurrence of disease is more important for efficient and valid clinical and epidemiological research. Moreover, stroke is a challenging phenotype due to diagnosis difficulty and common miscoding. This task generally requires utilization of multiple types of EHR data (e.g., diagnoses and procedure codes, unstructured clinical notes) and a more robust algorithm integrating both natural language processing and machine learning. In this study, we developed and validated an EHR-based classifier to accurately identify stroke incidence among a cohort of atrial fibrillation (AF) patients Methods: We developed a stroke phenotyping algorithm using International Classification of Diseases, Ninth Revision (ICD-9) codes, Current Procedural Terminology (CPT) codes, and expert-provided keywords as model features. Structured data was extracted from Rochester Epidemiology Project (REP) database. Natural Language Processing (NLP) was used to extract and validate keyword occurrence in clinical notes. A window of ±30 days was considered when including/excluding keywords/codes into the input vector. Frequencies of keywords/codes were used as input feature sets for model training. Multiple competing models were trained using various combinations of feature sets and two machine learning algorithms: logistic regression and random forest. Training data were provided by two nurse abstractors and included validated stroke incidences from a previously established atrial fibrillation cohort. Precision, recall, and F-score of the algorithm were calculated to assess and compare model performances. Results: Among 4,914 patients with atrial fibrillation, 1,773 patients were screened. 3,141 patients had no stroke-related codes or keywords and were presumed to be free of stroke during follow-up. Among the screened patients, 740 had validated strokes and 1,033 did not have a stroke based on review of the EHR by trained nurse abstractors. The best performing stroke incidence phenotyping classifier utilized Keywords+ICD-9+CPT features using a random forest classifier, achieving a precision of 0.942, recall of 0.943, and F-score of 0.943. Conclusion: In conclusion, we developed and validated a stroke algorithm that performed well for identifying stroke incidence in an enriched population (AF cohort), which extends beyond the typical binary case/non-case stroke identification problem. Future work will involve testing the generalizability of this algorithm in a general population.

Download Full-text

Use of Machine Learning and a Natural Language Processing Approach for Detecting Phishing Attacks

Natural Language Processing in Artificial Intelligence ◽

10.1201/9780367808495-9 ◽

2020 ◽

pp. 225-251

Author(s):

Chandrakanta Mahanty ◽

Devpriya Panda ◽

Brojo Kishore Mishra

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Phishing Attacks ◽

Processing Approach

Download Full-text

Automatic Classification of Research Papers Using Machine Learning Approaches and Natural Language Processing

Advances in Intelligent Systems and Computing - Information Technology and Systems ◽

10.1007/978-3-030-68285-9_8 ◽

2021 ◽

pp. 80-87

Author(s):

Ortiz Yesenia ◽

Segarra-Faggioni Veronica

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Automatic Classification ◽

Learning Approaches ◽

Research Papers

Download Full-text

Automating incidental findings in radiology reports using natural language processing and machine learning to identify and classify pulmonary nodules.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e18093 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e18093-e18093

Author(s):

Christi French ◽

Maciek Makowski ◽

Samantha Terker ◽

Paul Alexander Clark

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pulmonary Nodule ◽

Incidental Findings ◽

Lung Nodule ◽

Radiology Reports

e18093 Background: Pulmonary nodule incidental findings challenge providers to balance resource efficiency and high clinical quality. Incidental findings tend to be undertreated with studies reporting appropriate follow-up rates as low as 29%. Ensuring appropriate follow-up on all incidental findings is labor-intensive; requires the clinical reading and classification of radiology reports to identify high-risk lung nodules. We tested the feasibility of automating this process with natural language processing (NLP) and machine learning (ML). Methods: In cooperation with Sarah Cannon Research Institute (SCRI), we conducted a series of data science experiments utilizing NLP and ML computing techniques on 8,879 free-text, narrative CT (computerized tomography) radiology reports. Reports used were dated from Dec 8, 2015 - April 23, 2017, came from SCRI-affiliated Emergency Department, Inpatient, and Outpatient facilities and were a representative, random sample of the patient populations. Reports were divided into a development set for model training and validation, and a test set to evaluate model performance. Two models were developed - a “Nodule Model” was trained to detect the reported presence of a pulmonary nodule and a rules-based “Sizing Model” was developed to extract the size of the nodule in millimeters. Reports were bucketed into three prediction groups: > = 6 mm, < 6 mm, and no size indicated. Nodules were considered positives and placed in a queue for follow-up if the nodule was predicted > = 6 mm, or if the nodule had no size indicated and the radiology report contained the word “mass.” The Fleischner Society Guidelines and clinical review informed these definitions. Results: Precision and recall metrics were calculated for multiple model thresholds. A threshold was selected based on the validation set calculations and a success criterion of 90% queue precision was selected to minimize false positives. On the test dataset, the F1 measure of the entire pipeline (lung nodule classification model and size extraction model) was 72.9%, recall was 60.3%, and queue precision was 90.2%, exceeding success criteria. Conclusions: The experiments demonstrate the feasibility of NLP and ML technology to automate the detection and classification of pulmonary nodule incidental findings in radiology reports. This approach promises to improve healthcare quality by increasing the rate of appropriate lung nodule incidental finding follow-up and treatment without excessive labor or risking overutilization.

Download Full-text

Automatic classification of written descriptions by healthy adults: An overview of the application of natural language processing and machine learning techniques to clinical discourse analysis

Dementia & Neuropsychologia ◽

10.1590/s1980-57642014dn83000006 ◽

2014 ◽

Vol 8 (3) ◽

pp. 227-235 ◽

Cited By ~ 1

Author(s):

Cíntia Matsuda Toledo ◽

Andre Cunha ◽

Carolina Scarton ◽

Sandra Aluísio

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Binary Classification ◽

Machine Learning Techniques ◽

Support Vector ◽

Brain Injured ◽

Written Descriptions

Discourse production is an important aspect in the evaluation of brain-injured individuals. We believe that studies comparing the performance of brain-injured subjects with that of healthy controls must use groups with compatible education. A pioneering application of machine learning methods using Brazilian Portuguese for clinical purposes is described, highlighting education as an important variable in the Brazilian scenario.OBJECTIVE: The aims were to describe how to: (i) develop machine learning classifiers using features generated by natural language processing tools to distinguish descriptions produced by healthy individuals into classes based on their years of education; and (ii) automatically identify the features that best distinguish the groups.METHODS: The approach proposed here extracts linguistic features automatically from the written descriptions with the aid of two Natural Language Processing tools: Coh-Metrix-Port and AIC. It also includes nine task-specific features (three new ones, two extracted manually, besides description time; type of scene described - simple or complex; presentation order - which type of picture was described first; and age). In this study, the descriptions by 144 of the subjects studied in Toledo18 were used, which included 200 healthy Brazilians of both genders.RESULTS AND CONCLUSION:A Support Vector Machine (SVM) with a radial basis function (RBF) kernel is the most recommended approach for the binary classification of our data, classifying three of the four initial classes. CfsSubsetEval (CFS) is a strong candidate to replace manual feature selection methods.

Download Full-text