Development of a Medical-text Parsing Algorithm Based on Character Adjacent Probability Distribution for Japanese Radiology Reports
Summary Objectives: The objectives of this study were to investigate the transitional probability distribution of medical term boundaries between characters and to develop a parsing algorithm specifically for medical texts. Methods: Medical terms in Japanese computed tomography (CT) reports were identified using the ChaSen morphological analysis system. MeSH-based medical terms (51,385 entries), obtained from the metathesaurus in the Unified Medical Language System (UMLS, 2005AA), were added as a medical dictionary for ChaSen. A radiographer corrected the set of results containing 300 parsed CT reports. In addition, two radiologists checked the medical term parsing of 200 CT sentences. Results: We obtained modified inter-annotator agreement scores for the text corrected by the radiologists. We retrieved the transitional probability as the conditional probability of a uni-gram, bi-gram, and tri-gram. The highest transitional probability P(Ci | Ci - 2*Ci - 1) was 1.00. For an example of anatomical location, the term “pulmonary hilum” was parsed as a tri-gram. Conclusions: Retrieval of transitional probability will improve the accuracy of parsing compound medical terms.