scholarly journals Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics

Author(s):  
Hannes Seuss ◽  
Peter Dankerl ◽  
Matthias Ihle ◽  
Andrea Grandjean ◽  
Rebecca Hammon ◽  
...  

Purpose Projects involving collaborations between different institutions require data security via selective de-identification of words or phrases. A semi-automated de-identification tool was developed and evaluated on different types of medical reports natively and after adapting the algorithm to the text structure. Materials and Methods A semi-automated de-identification tool was developed and evaluated for its sensitivity and specificity in detecting sensitive content in written reports. Data from 4671 pathology reports (4105 + 566 in two different formats), 2804 medical reports, 1008 operation reports, and 6223 radiology reports of 1167 patients suffering from breast cancer were de-identified. The content was itemized into four categories: direct identifiers (name, address), indirect identifiers (date of birth/operation, medical ID, etc.), medical terms, and filler words. The software was tested natively (without training) in order to establish a baseline. The reports were manually edited and the model re-trained for the next test set. After manually editing 25, 50, 100, 250, 500 and if applicable 1000 reports of each type re-training was applied. Results In the native test, 61.3 % of direct and 80.8 % of the indirect identifiers were detected. The performance (P) increased to 91.4 % (P25), 96.7 % (P50), 99.5 % (P100), 99.6 % (P250), 99.7 % (P500) and 100 % (P1000) for direct identifiers and to 93.2 % (P25), 97.9 % (P50), 97.2 % (P100), 98.9 % (P250), 99.0 % (P500) and 99.3 % (P1000) for indirect identifiers. Without training, 5.3 % of medical terms were falsely flagged as critical data. The performance increased, after training, to 4.0 % (P25), 3.6 % (P50), 4.0 % (P100), 3.7 % (P250), 4.3 % (P500), and 3.1 % (P1000). Roughly 0.1 % of filler words were falsely flagged. Conclusion Training of the developed de-identification tool continuously improved its performance. Training with roughly 100 edited reports enables reliable detection and labeling of sensitive data in different types of medical reports. Key Points:  Citation Format

2021 ◽  
pp. 172-178
Author(s):  
 Дмитрий Вадимович Любимов

Рецензируемая книга Александра  Михайловича  Терехина «Сумасшествия в  музыкальном театре: опера, балет», изданная в  2020 году, уникальна во  многих отношениях. Во-первых, интерес вызывает личность самого автора. Профессиональная и  творческая деятельность Терехина связана с  медициной и  музыкой. Врач-психиатр с  сорокалетним стажем, Терехин более четверти века работал в Мариинском театре в качестве артиста миманса. Во-вторых, тема сумасшествия (безумия) еще не  становилась самостоятельным предметом изучения в российских музыковедческих исследованиях. В-третьих, оригинальность книги составляют медицинские заключения. В центре внимания врача — тексты оперных и балетных либретто, на основе которых автор раскрывает причины помешательства и  ставит различные диагнозы персонажам музыкального театра. Рассматривая конкретные клинические случаи, Терехин прибегает к профессиональным медицинским терминам. Среди диагнозов отметим такие, как реактивный параноид у Лючии («Лючия ди Ламмермур» Г. Доницетти), шизофрения у Мельника («Русалка» А. С. Даргомыжского, интоксикационный (гашишный) психоз у Солора («Баядерка» Л. Минкуса). Книга А. М. Терехина, не претендуя на всеохватность освещения темы сумасшествия (безумия), открывает музыковедам новые грани междисциплинарного подхода в изучении оперного и балетного репертуара. Alexander Mikhailovich Terekhin’s peer-reviewed book Madness in Musical Theater: Opera, Ballet, published in 2020, is unique in many ways. Firstly, the author’s personality is of interest. Terekhin’s professional and creative activity is connected with medicine and music. A psychiatrist with forty years of experience, Terekhin worked for more than a quarter of a century at the Mariinsky Theater as a mimance artist. Secondly, the theme of madness (insanity) has not yet become an independent subject of study in Russian musicological studies. Thirdly, the originality of the book is based on medical reports. The doctor focuses on the texts of opera and ballet librettos, on the basis of which the author reveals the causes of insanity and makes various diagnoses to the characters of the musical theater. Considering a specific clinical case, Terekhin resorts to professional medical terms. Among some diagnoses we can mention: reactive paranoid in Lucia (Lucia di Lammermoor by Donizetti), schizophrenia in Melnik (Rusalka by Dargomyzhsky, intoxicational (hashish) psychosis in Solor (La Bayadere by Minkus). Without claiming to cover the topic of insanity (insanity) comprehensively the book by Terekhin, opens up new facets of the interdisciplinary approach in studying opera and ballet repertoire to musicologists.


Author(s):  
Pijush Kanti Dutta Pramanik ◽  
Saurabh Pal ◽  
Moutan Mukhopadhyay

Like other fields, the healthcare sector has also been greatly impacted by big data. A huge volume of healthcare data and other related data are being continually generated from diverse sources. Tapping and analysing these data, suitably, would open up new avenues and opportunities for healthcare services. In view of that, this paper aims to present a systematic overview of big data and big data analytics, applicable to modern-day healthcare. Acknowledging the massive upsurge in healthcare data generation, various ‘V's, specific to healthcare big data, are identified. Different types of data analytics, applicable to healthcare, are discussed. Along with presenting the technological backbone of healthcare big data and analytics, the advantages and challenges of healthcare big data are meticulously explained. A brief report on the present and future market of healthcare big data and analytics is also presented. Besides, several applications and use cases are discussed with sufficient details.


2021 ◽  
Author(s):  
Kristy Moniz

Despite the proliferation of multimedia software technologies, radiology reports continue to lack image content that would improve the ability of referring clinicians to fully interpret and analyze radiological findings. This thesis demonstrates that it is possible to construct a radiology reporting software system that contains both text and image content using only "off-the-shelf" multimedia software. Specifically, a software system is presented that provides enhanced visual multimedia capabilities, structured content, and reduced report production time, using a well-known PDF program, Adobe Acrobat. The system, which we call the Multimedia Radiology Report System, or MaRRs, allows radiologists to quickly and simply create and deliver effective interactive multimedia medical reports. A detailed analysis describing the unique structure and functionality of MaRRS will be presented to demonstrate its advantages for both radiologists and referring clinicians.


2021 ◽  
Author(s):  
Kristy Moniz

Despite the proliferation of multimedia software technologies, radiology reports continue to lack image content that would improve the ability of referring clinicians to fully interpret and analyze radiological findings. This thesis demonstrates that it is possible to construct a radiology reporting software system that contains both text and image content using only "off-the-shelf" multimedia software. Specifically, a software system is presented that provides enhanced visual multimedia capabilities, structured content, and reduced report production time, using a well-known PDF program, Adobe Acrobat. The system, which we call the Multimedia Radiology Report System, or MaRRs, allows radiologists to quickly and simply create and deliver effective interactive multimedia medical reports. A detailed analysis describing the unique structure and functionality of MaRRS will be presented to demonstrate its advantages for both radiologists and referring clinicians.


2015 ◽  
Vol 16 (1) ◽  
pp. 92-107
Author(s):  
Milan Orlić

In this paper I analyze two of Pekić’s novels in the light of Bakhtin’s concept of the open text of the polyphonic novel which Pekić develops by means of a new Narrator Figure and a new poetics based on an encyclopedic embedded text structure. Among several literary techniques developed from the beginnings of Pekić’s writing, crucial importance belongs to what I call the Explicit Narrator Figure (for instance, in The Time of Miracles, 1965), who speaks in his own voice as interpreter of found texts, and the Implicit Narrator Figure, who adopts the literary and non-literary voices of (many) others, to whose diction and style he assimilates his own voice (for example, in Pilgrimage of Arsenije Njegovan, 1970). This new (postmodern) narrator figure, both explicit and implicit, acts as an interpreter of «found» texts. What connects these two types of Narrator Figures is the document and related Embedded Narration: both narrators thus deal with the pre-texts as well as texts-in-texts, levels and layers of texts, proto-texts and meta-texts – various types of Framed/Embedded Narratives. The Implicit Narrator Figure deals with Biblical witnessed texts and the Explicit Narrator Figure uses personal testamentary texts. In such a way, both Implicit and Explicit Narrator Figures become the researchers of different types of literary and non-literary documents. These complex inter-textual explorations of the “library” of culture are “encyclopedic” in magnitude and reveal, in combination with the new Narrator Figure’s status as Editor and Interpreter, a new type of narrative text, constituted in the encyclopedic open novel structure. Pekić thus introduces a new form of inter-textuality into Serbian literature, implicitly extending Bakhtin’s (and Dostoevsky’s) legacy by drawing on the Serbian national literary canon and the entire Western cultural “library”.


2014 ◽  
Vol 6 (4) ◽  
pp. 332-340 ◽  
Author(s):  
Deepak Agrawal

Purpose – This paper aims to trace the history, application areas and users of Classical Analytics and Big Data Analytics. Design/methodology/approach – The paper discusses different types of Classical and Big Data Analytical techniques and application areas from the early days to present day. Findings – Businesses can benefit from a deeper understanding of Classical and Big Data Analytics to make better and more informed decisions. Originality/value – This is a historical perspective from the early days of analytics to present day use of analytics.


2008 ◽  
Vol 47 (06) ◽  
pp. 513-521 ◽  
Author(s):  
S. Terae ◽  
M. Uesugi ◽  
K. Ogasawara ◽  
T. Sakurai ◽  
N. Nishimoto

Summary Objectives: The objectives of this study were to investigate the transitional probability distribution of medical term boundaries between characters and to develop a parsing algorithm specifically for medical texts. Methods: Medical terms in Japanese computed tomography (CT) reports were identified using the ChaSen morphological analysis system. MeSH-based medical terms (51,385 entries), obtained from the metathesaurus in the Unified Medical Language System (UMLS, 2005AA), were added as a medical dictionary for ChaSen. A radiographer corrected the set of results containing 300 parsed CT reports. In addition, two radiologists checked the medical term parsing of 200 CT sentences. Results: We obtained modified inter-annotator agreement scores for the text corrected by the radiologists. We retrieved the transitional probability as the conditional probability of a uni-gram, bi-gram, and tri-gram. The highest transitional probability P(Ci | Ci - 2*Ci - 1) was 1.00. For an example of anatomical location, the term “pulmonary hilum” was parsed as a tri-gram. Conclusions: Retrieval of transitional probability will improve the accuracy of parsing compound medical terms.


2014 ◽  
Vol 32 (30_suppl) ◽  
pp. 164-164 ◽  
Author(s):  
Lauren P. Wallner ◽  
Julia R. Dibello ◽  
Bonnie H. Li ◽  
Chengyi Zheng ◽  
Wei Yu ◽  
...  

164 Background: Prostate cancer patients who develop metastases are a difficult population to identify through administrative diagnostic codes, due to their protracted time to metastases, limited survival and the inconsistent use of specific codes. As a result, research that is needed to inform the delivery of high-quality care in this setting is limited. Therefore, the goal of this study was to develop an algorithm, which utilizes EMR data to identify men who progress to metastatic prostate cancer after diagnosis using natural language processing (NLP). Methods: An electronic algorithm was developed to search unstructured text using NLP to identify progression to metastases among men with a diagnosis of prostate cancer between 1992 and 2010 in a large, diverse cohort of men who were part of an ongoing study focused on prostate cancer mortality. A training set of 449 men who were diagnosed as early stage prostate cancer was used for development. Pathology, radiology and clinic notes were searched from diagnosis until death or loss to follow-up. Pathology reports were searched for mention of adenocarcinoma in the metastatic lesion, radiology reports were searched for abnormal findings consistent with metastases, and clinic notes were searched for mentions of increasing pain or narcotic use related to metastases. Each NLP component was validated against manual review of the corresponding records. Results: Of the 449 men in the training set, 40 (8.9%) were found to have metastatic prostate cancer. The majority of cases had evidence of metastases in their clinic notes (98%). Radiology reports identified 18% of cases, and pathology reports identified 5%. Of the 40 cases identified, 25% did not have a corresponding ICD-9 codes for metastatic cancer. However, 7.5% used ADT, 37.5% had increasing oncology visits and 22.5% had rapidly rising PSA levels. Conclusions: Our results suggest that NLP can be used to identify men with metastatic prostate cancer in the EMR more accurately than diagnosis codes alone. The automated identification of patients with metastatic cancer facilitates quality of care research in this setting to ensure the delivery of appropriate and high-quality care.


1995 ◽  
Vol 34 (04) ◽  
pp. 352-360 ◽  
Author(s):  
J. F. Smart ◽  
M. Roux

Abstract:A new knowledge-representation system is presented, designed for medical knowledge-based applications and in particular for the analysis of descriptive medical reports. Knowledge is represented at two levels. A definitional level uses a concept-type hierarchy, a relation-type hierarchy, and a set of schematic graphs to define the concepts used and the relations between them, as well as different types of cardinality restrictions on these relations. A set of compositional hierarchies using the classic “has-part” relation as well as a new set-inclusion relation allows concept composition to be precisely defined. An assertional level allows the creation and manipulation of empirical data, in the form of graphs using the concepts, relations, and constraints defined at the definition level. The use of cardinality constraints in graph unification is considered in the context of descriptive medical discourse analysis.


2016 ◽  
Vol 141 (3) ◽  
pp. 418-422 ◽  
Author(s):  
Andrew A. Renshaw ◽  
Edwin W. Gould

Context.— The College of American Pathologists requires synoptic reports for specific types of pathology reports. Objective.— To compare the accuracy and speed of information retrieval in synoptic reports of different formats. Design.— We assessed the performance of 28 nonpathologists from 4 different types of users (cancer registrars, MDs, medical non–MDs, and nonmedical) at identifying specific information in various formatted synoptic reports, using a computerized quiz that measured both accuracy and speed. Results.— There was no significant difference in the accuracy of data identification for any user group or in any format. While there were significant differences in raw time between users, these were eliminated when normalized times were used. Compared with the standard format of a required data element (RDE) and response on 1 line, both a list of responses without an RDE (21%, P < .001) and a paired response with more concise text (33%, P < .001) were significantly faster. In contrast, both the 2-line format (RDE header on one line, response indented on the second line) (12%, P < .001) and a report with the RDE response pairs in a random order were significantly slower (16%, P < .001). Conclusions.— There are significant differences in ease of use by nonpathologists between different synoptic report formats. Such information may be useful in deciding between different format options.


Sign in / Sign up

Export Citation Format

Share Document