Privacy Preserving Text Data Encoding and Topic Modelling

Author(s):  
Dinusha Vatsalan ◽  
Raghav Bhaskar ◽  
Aris Gkoulalas-Divanis ◽  
Dimitrios Karapiperis
Crime Science ◽  
2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Daniel Birks ◽  
Alex Coleman ◽  
David Jackson

Abstract We present a novel exploratory application of unsupervised machine-learning methods to identify clusters of specific crime problems from unstructured modus operandi free-text data within a single administrative crime classification. To illustrate our proposed approach, we analyse police recorded free-text narrative descriptions of residential burglaries occurring over a two-year period in a major metropolitan area of the UK. Results of our analyses demonstrate that topic modelling algorithms are capable of clustering substantively different burglary problems without prior knowledge of such groupings. Subsequently, we describe a prototype dashboard that allows replication of our analytical workflow and could be applied to support operational decision making in the identification of specific crime problems. This approach to grouping distinct types of offences within existing offence categories, we argue, has the potential to support crime analysts in proactively analysing large volumes of modus operandi free-text data—with the ultimate aims of developing a greater understanding of crime problems and supporting the design of tailored crime reduction interventions.


2021 ◽  
Vol 28 (1) ◽  
pp. e100274
Author(s):  
Paul Fairie ◽  
Zilong Zhang ◽  
Adam G D'Souza ◽  
Tara Walsh ◽  
Hude Quan ◽  
...  

ObjectivesPatient feedback is critical to identify and resolve patient safety and experience issues in healthcare systems. However, large volumes of unstructured text data can pose problems for manual (human) analysis. This study reports the results of using a semiautomated, computational topic-modelling approach to analyse a corpus of patient feedback.MethodsPatient concerns were received by Alberta Health Services between 2011 and 2018 (n=76 163), regarding 806 care facilities in 163 municipalities, including hospitals, clinics, community care centres and retirement homes, in a province of 4.4 million. Their existing framework requires manual labelling of pre-defined categories. We applied an automated latent Dirichlet allocation (LDA)-based topic modelling algorithm to identify the topics present in these concerns, and thereby produce a framework-free categorisation.ResultsThe LDA model produced 40 topics which, following manual interpretation by researchers, were reduced to 28 coherent topics. The most frequent topics identified were communication issues causing delays (frequency: 10.58%), community care for elderly patients (8.82%), interactions with nurses (8.80%) and emergency department care (7.52%). Many patient concerns were categorised into multiple topics. Some were more specific versions of categories from the existing framework (eg, communication issues causing delays), while others were novel (eg, smoking in inappropriate settings).DiscussionLDA-generated topics were more nuanced than the manually labelled categories. For example, LDA found that concerns with community care were related to concerns about nursing for seniors, providing opportunities for insight and action.ConclusionOur findings outline the range of concerns patients share in a large health system and demonstrate the usefulness of using LDA to identify categories of patient concerns.


2020 ◽  
Author(s):  
Daniel Birks ◽  
Alex Coleman ◽  
David Jackson

We present a novel exploratory application of unsupervised machine-learning methods to identify clusters of specific crime problems from unstructured modus operandi free-text data within a single administrative crime classification. To illustrate our proposed approach, we analyse police recorded free-text narrative descriptions of residential burglaries occurring over a two-year period in a major metropolitan area of the UK. Results of our analyses demonstrate that topic modelling algorithms are capable of clustering substantively different burglary problems without prior knowledge of such groupings. Subsequently, we describe a prototype dashboard that allows replication of our analytical workflow and could be applied to support operational decision making in the identification of specific crime problems. This approach to grouping distinct types of offences within existing offence categories, we argue, has the potential to support crime analysts in proactively analysing large volumes of modus operandi free-text data – with the ultimate aims of developing a greater understanding of crime problems and supporting the design of tailored crime reduction interventions.


2021 ◽  
Author(s):  
Liam Wright ◽  
Meg E Fluharty ◽  
Andrew Steptoe ◽  
Daisy Fancourt

Background: The COVID-19 pandemic has had substantial impacts on lives across the globe. Job losses have been widespread, and individuals have experienced significant restrictions on their usual activities, including extended isolation from family and friends. While studies suggest population mental health worsened from before the pandemic, not all individuals appear to have experienced poorer mental health. This raises the question of how people managed to cope during the pandemic. Methods: To understand the coping strategies individuals employed during the COVID-19 pandemic, we used structural topic modelling, a text mining technique, to extract themes from free-text data on coping from over 11,000 UK adults, collected between 14 October and 26 November 2020. Results: We identified 16 topics. The most discussed coping strategy was 'thinking positively' and involved themes of gratefulness and positivity. Other strategies included engaging in activities and hobbies (such as doing DIY, exercising, walking and spending time in nature), keeping routines, and focusing on one day at a time. Some participants reported more avoidant coping strategies, such as drinking alcohol and binge eating. Coping strategies varied by respondent characteristics including age, personality traits and sociodemographic characteristics and some coping strategies, such as engaging in creative activities, were associated with more positive lockdown experiences. Conclusion: A variety of coping strategies were employed by individuals during the COVID-19 pandemic. The coping strategy an individual adopted was related to their overall lockdown experiences. This may be useful for helping individuals prepare for future lockdowns or other events resulting in self-isolation.


1996 ◽  
Vol 35 (02) ◽  
pp. 108-111 ◽  
Author(s):  
F. Puerner ◽  
H. Soltanian ◽  
J. H. Hohnloser

AbstractData are presented on the use of a browsing and encoding utility to improve coded data entry for an electronic patient record system. Traditional and computerized discharge summaries were compared: during three phases of coding ICD-9 diagnoses phase I, no coding; phase II, manual coding, and phase III, computerized semiautomatic coding. Our data indicate that (1) only 50% of all diagnoses in a discharge summary are encoded manually; (2) using a computerized browsing and encoding utility this percentage may increase by 64%; (3) when forced to encode manually, users may “shift” as much as 84% of relevant diagnoses from the appropriate coding section to other sections thereby “bypassing” the need to encode, this was reduced by up to 41 % with the computerized approach, and (4) computerized encoding can improve completeness of data encoding, from 46 to 100%. We conclude that the use of a computerized browsing and encoding tool can increase data quality and the percentage of documented data. Mechanisms bypassing the need to code can be avoided.


1976 ◽  
Vol 15 (01) ◽  
pp. 21-28 ◽  
Author(s):  
Carmen A. Scudiero ◽  
Ruth L. Wong

A free text data collection system has been developed at the University of Illinois utilizing single word, syntax free dictionary lookup to process data for retrieval. The source document for the system is the Surgical Pathology Request and Report form. To date 12,653 documents have been entered into the system.The free text data was used to create an IRS (Information Retrieval System) database. A program to interrogate this database has been developed to numerically coded operative procedures. A total of 16,519 procedures records were generated. One and nine tenths percent of the procedures could not be fitted into any procedures category; 6.1% could not be specifically coded, while 92% were coded into specific categories. A system of PL/1 programs has been developed to facilitate manual editing of these records, which can be performed in a reasonable length of time (1 week). This manual check reveals that these 92% were coded with precision = 0.931 and recall = 0.924. Correction of the readily correctable errors could improve these figures to precision = 0.977 and recall = 0.987. Syntax errors were relatively unimportant in the overall coding process, but did introduce significant error in some categories, such as when right-left-bilateral distinction was attempted.The coded file that has been constructed will be used as an input file to a gynecological disease/PAP smear correlation system. The outputs of this system will include retrospective information on the natural history of selected diseases and a patient log providing information to the clinician on patient follow-up.Thus a free text data collection system can be utilized to produce numerically coded files of reasonable accuracy. Further, these files can be used as a source of useful information both for the clinician and for the medical researcher.


Author(s):  
I. G. Zakharova ◽  
Yu. V. Boganyuk ◽  
M. S. Vorobyova ◽  
E. A. Pavlova

The article goal is to demonstrate the possibilities of the approach to diagnosing the level of IT graduates’ professional competence, based on the analysis of the student’s digital footprint and the content of the corresponding educational program. We describe methods for extracting student professional level indicators from digital footprint text data — courses’ descriptions and graduation qualification works. We show methods of comparing these indicators with the formalized requirements of employers, reflected in the texts of vacancies in the field of information technology. The proposed approach was applied at the Institute of Mathematics and Computer Science of the University of Tyumen. We performed diagnostics using a data set that included texts of courses’ descriptions for IT areas of undergraduate studies, 542 graduation qualification works in these areas, 879 descriptions of job requirements and information on graduate employment. The presented approach allows us to evaluate the relevance of the educational program as a whole and the level of professional competence of each student based on objective data. The results were used to update the content of some major courses and to include new elective courses in the curriculum.


Author(s):  
Htay Htay Win ◽  
Aye Thida Myint ◽  
Mi Cho Cho

For years, achievements and discoveries made by researcher are made aware through research papers published in appropriate journals or conferences. Many a time, established s researcher and mainly new user are caught up in the predicament of choosing an appropriate conference to get their work all the time. Every scienti?c conference and journal is inclined towards a particular ?eld of research and there is a extensive group of them for any particular ?eld. Choosing an appropriate venue is needed as it helps in reaching out to the right listener and also to further one’s chance of getting their paper published. In this work, we address the problem of recommending appropriate conferences to the authors to increase their chances of receipt. We present three di?erent approaches for the same involving the use of social network of the authors and the content of the paper in the settings of dimensionality reduction and topic modelling. In all these approaches, we apply Correspondence Analysis (CA) to obtain appropriate relationships between the entities in question, such as conferences and papers. Our models show hopeful results when compared with existing methods such as content-based ?ltering, collaborative ?ltering and hybrid ?ltering.


2012 ◽  
Vol 3 (3) ◽  
pp. 60-61
Author(s):  
V.Sajeev V.Sajeev ◽  
◽  
R.Gowthamani R.Gowthamani

Sign in / Sign up

Export Citation Format

Share Document