Professional language in Swedish clinical text: Linguistic characterization and comparative studies

2014 ◽  
Vol 37 (2) ◽  
pp. 297-323 ◽  
Author(s):  
Kelly Smith ◽  
Beata Megyesi ◽  
Sumithra Velupillai ◽  
Maria Kvist

This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctor's daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification.

2020 ◽  
Author(s):  
Shintaro Tsuji ◽  
Andrew Wen ◽  
Naoki Takahashi ◽  
Hongjian Zhang ◽  
Katsuhiko Ogasawara ◽  
...  

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.


1983 ◽  
Vol 17 ◽  
pp. 58-81
Author(s):  
Jan Erik Grezel ◽  
Hans Buiter ◽  
Ton van der Geest

In this paper the effect of monitoring in the acquisition of Dutch as a second language has been investigated in a descriptive design. Starting point was an experimental investigation, carried out by Hulstijn (1982). As his investigation was restricted to only two variables, a number of experimental conditions (unnatural situation) and to correct sentences only, it was decided to replicate this investigation with the following alterations: (1) only natural data from three different situations were used. These data ranked from formal to informal: dialogue (informal), monologue (formal) and written report (formal); (2) all kinds of linguistic variables that were relevant for the acquisition stage of the subjects were scored: syntactic, morphological, lexical variables, both correct and incorrect usage; (3) subjects were subdivided with respect to LI into English and less related languages (Slavic), and with respect to L2-achievement according to the teacher: good and not so good achievers. Some results 1. English speaking subjects and good achievers had better scores generally on the variables under investigation. This means that these variables are valid to describe the language acquisition process of Dutch as L2. 2. Those linguistic features that are well acquired are under the domain of monitoring in such a way that under formal circumstances (more reflection time) fewer errors occurred: word order, content words, and those morphological phenomena that are essential for the meaning of the message (tense, plural). 3. Those linguistic features that are not internalized completely are under the domain of monitoring in such a way that under formal circumstances more errors occurred: morphological phenomena that are less relevant with respect to meaning (e.g. incorrect plurals), and function words. 4. English subjects and good achievers demonstrated more correct monitoring. 5. The results 1. to 4. fit quite well into the L1=L2 hypothesis. There seems to be a universal order for language acquisition that is influenced only in minor points by the LI of the language learner. These findings have some interesting consequences for L2-education.


Author(s):  
Steven N. Dworkin

This book describes the linguistic structures that constitute Medieval or Old Spanish as preserved in texts written prior to the beginning of the sixteenth century. It emphasizes those structures that contrast with the modern standard language. Chapter 1 presents methodological issues raised by the study of a language preserved only in written sources. Chapter 2 examines questions involved in reconstructing the sound system of Old Spanish before discussing relevant phonetic and phonological details. The chapter ends with an overview of Old Spanish spelling practices. Chapter 3 presents in some detail the nominal, verbal, and pronominal morphology of the language, with attention to regional variants. Chapter 4 describes selected syntactic structures, with emphasis on the noun phrase, verb phrase, object pronoun placement, subject-verb-object word order, verb tense, aspect, and mood. Chapter 5 begins with an extensive list of Old Spanish nouns, adjectives, verbs, and function words that have not survived into the modern standard language. It then presents examples of coexisting variants (doublets) and changes of meaning, and finishes with an overview of the creation of neologisms in the medieval language through derivational morphology (prefixation, suffixation, compounding). The book concludes with an anthology composed of three extracts from Spanish prose texts, one each from the thirteenth, fourteenth, and fifteenth centuries. The extracts contain footnotes that highlight relevant morphological, syntactic, and lexical features, with cross references to the relevant sections in the body of the book.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Sabri Eyuboglu ◽  
Geoffrey Angus ◽  
Bhavik N. Patel ◽  
Anuj Pareek ◽  
Guido Davidzon ◽  
...  

AbstractComputational decision support systems could provide clinical value in whole-body FDG-PET/CT workflows. However, limited availability of labeled data combined with the large size of PET/CT imaging exams make it challenging to apply existing supervised machine learning systems. Leveraging recent advancements in natural language processing, we describe a weak supervision framework that extracts imperfect, yet highly granular, regional abnormality labels from free-text radiology reports. Our framework automatically labels each region in a custom ontology of anatomical regions, providing a structured profile of the pathologies in each imaging exam. Using these generated labels, we then train an attention-based, multi-task CNN architecture to detect and estimate the location of abnormalities in whole-body scans. We demonstrate empirically that our multi-task representation is critical for strong performance on rare abnormalities with limited training data. The representation also contributes to more accurate mortality prediction from imaging data, suggesting the potential utility of our framework beyond abnormality detection and location estimation.


SAGE Open ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 215824402110299
Author(s):  
Cong Zhang ◽  
Rui Yuan

The differences of linguistic features between Chang Hen Ge ( Ge) and Chang Hen Ge Zhuan ( Zhuan) have rarely been mentioned in the relevant fields. Nevertheless, these differences can best highlight the specialness of poetry, for the two works were written contemporaneously by two friends on the same subject, in distinct styles. This article employs quantitative methods and indicators to provide empirical evidence for the specialness of Ge through comparisons between the two. The results show that, on the premise of expressing the same subject in different styles, Ge does have certain linguistic characteristics compared with Zhuan. Its particularity is reflected not only in fewer repeat characters and words but also in their richness, as well as in the use of more content words and fewer function words. Moreover, all of these characteristics have had a great influence on Ge’s artistic level and dissemination. Through this study, we hope that our methods provide a new perspective and shed some light on this area.


2019 ◽  
Vol 4 (2) ◽  
pp. 84-95
Author(s):  
Nandang Rachmat

The basic meaning of the morphological aspect of Japanese is the opposition between the form -ru/-tawhich expresses perfective,  and -teiru/-teitawhich expresses imperfective. Also there are perfect meanings which derivate from the basic meaning of -taand -teiru/-teitaforms. They refer to the fact that a certain result or effect of previous activity remain at a certain point of time. In Indonesian function wordssudahand telah, which are generally considered as perfective markers, can often be the equivalent of perfect meanings in Japanese. Therefore, it is necessary to clarify the differences between perfect aspect meanings in both languages mainly regarding the use of words sudahandtelah. This paper aims to explain perfect meanings in Japanese and Indonesian through the use of -ta, -teiru, -teitaforms and function words sudahand telah by contrastive analysis. The analysis showed that the perfect meanings cannot be fully matched with the use of sudahandtelah. They are not interchangeable because of differences in aspectual, modal, and contextual meanings. Some of them are expressed without using sudahor telah at all. Sudahmeans ingressive aspect, and refers to the result or effect of previous activities. As modal meanings, sudah indicates two things, that the speaker possesses predictions about a future event and the speaker’s attitude to provide the hearer information. Telah means completive aspect. It does not refer to the meaning of the effect of a previous activity, therefore it can not function as taxis on future perfect aspect.


2011 ◽  
pp. 2085-2095
Author(s):  
John P. Pestian ◽  
Lukasz Itert ◽  
Charlotte Andersen

Approximately 57 different types of clinical annotations construct a patient’s medical record. These annotations include radiology reports, discharge summaries, and surgical and nursing notes. Hospitals typically produce millions of text-based medical records over the course of a year. These records are essential for the delivery of care, but many are underutilized or not utilized at all for clinical research. The textual data found in these annotations is a rich source of insights into aspects of clinical care and the clinical delivery system. Recent regulatory actions, however, require that, in many cases, data not obtained through informed consent or data not related to the delivery of care must be made anonymous (as referred to by regulators as harmless), before they can be used. This article describes a practical approach with which Cincinnati Children’s Hospital Medical Center (CCHMC), a large pediatric academic medical center with more than 761,000 annual patient encounters, developed open source software for making pediatric clinical text harmless without losing its rich meaning. Development of the software dealt with many of the issues that often arise in natural language processing, such as data collection, disambiguation, and data scrubbing.


Sign in / Sign up

Export Citation Format

Share Document