medExtractR: A medication extraction algorithm for electronic health records using the R programming language
ABSTRACTObjectiveWe developed medExtractR, a natural language processing system to extract medication dose and timing information from clinical notes. Our system facilitates creation of medication-specific research datasets from electronic health records.Materials and MethodsWritten using the R programming language, medExtractR combines lexicon dictionaries and regular expression patterns to identify relevant medication information (‘drug entities’). The system is designed to extract particular medications of interest, rather than all possible medications mentioned in a clinical note. MedExtractR was developed on notes from Vanderbilt University’s Synthetic Derivative, using two medications (tacrolimus and lamotrigine) prescribed with varying complexity, and with a third drug (allopurinol) used for testing generalizability of results. We evaluated medExtractR and compared it to three existing systems: MedEx, MedXN, and CLAMP.ResultsOn 50 test notes for each development drug and 110 test notes for the additional drug, medExtractR achieved high overall performance (F-measures > 0.95). This exceeded the performance of the three existing systems across all drugs, with the exception of a couple specific entity-level evaluations including dose amount for lamotrigine and allopurinol.DiscussionMedExtractR successfully extracted medication entities for medications of interest. High performance in entity-level extraction tasks provides a strong foundation for developing robust research datasets for pharmacological research. However, its targeted approach provides a narrower scope compared with existing systems.ConclusionMedExtractR (available as an R package) achieved high performance values in extracting specific medications from clinical text, leading to higher quality research datasets for drug-related studies than some existing general-purpose medication extraction tools.