Pathologic findings in reduction mammoplasty procedures identified by natural language processing of breast pathology reports: A surrogate for the population incidence of cancer and high risk lesions.

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.

Download Full-text

THE USE OF NATURAL LANGUAGE PROCESSING TOOLS AND THE INTEGRATED ELECTRONIC MEDICAL RECORD (EMR) TO IDENTIFY PATIENTS AT HIGH-RISK FOR TRANSTHYRETIN CARDIAC AMYLOIDOSIS

Journal of the American College of Cardiology ◽

10.1016/s0735-1097(21)02198-7 ◽

2021 ◽

Vol 77 (18) ◽

pp. 839

Author(s):

Trejeeve Martyn ◽

Joshua Saef ◽

Jerry Estep ◽

Patrick Collier ◽

Deborah Kwon ◽

...

Keyword(s):

Natural Language Processing ◽

High Risk ◽

Natural Language ◽

Medical Record ◽

Electronic Medical Record ◽

Language Processing ◽

Cardiac Amyloidosis

Download Full-text

Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study (Preprint)

10.2196/preprints.20826 ◽

2020 ◽

Author(s):

Carlos R Oliveira ◽

Patrick Niccolai ◽

Anette Michelle Ortiz ◽

Sangini S Sheth ◽

Eugene D Shapiro ◽

...

Keyword(s):

Natural Language Processing ◽

Human Papillomavirus ◽

Natural Language ◽

Language Processing ◽

Medical Records ◽

Processing Algorithm ◽

Accurate Identification ◽

Pathology Reports ◽

Manual Review ◽

Natural Language Processing Algorithm

BACKGROUND Accurate identification of new diagnoses of human papillomavirus–associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavirus disease for surveillance and research. OBJECTIVE This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus. METHODS A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm’s classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm’s performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure. RESULTS The natural language processing algorithm’s performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87). CONCLUSIONS This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology.

Download Full-text