scholarly journals Data innovation for international development: An overview of natural language processing for qualitative data analysis

Author(s):  
Philipp Broniecki ◽  
Anna Hanchar
2021 ◽  
Author(s):  
Vishal Dey ◽  
Peter Krasniak ◽  
Minh Nguyen ◽  
Clara Lee ◽  
Xia Ning

BACKGROUND A new illness can come to public attention through social media before it is medically defined, formally documented, or systematically studied. One example is a condition known as breast implant illness (BII), which has been extensively discussed on social media, although it is vaguely defined in the medical literature. OBJECTIVE The objective of this study is to construct a data analysis pipeline to understand emerging illnesses using social media data and to apply the pipeline to understand the key attributes of BII. METHODS We constructed a pipeline of social media data analysis using natural language processing and topic modeling. Mentions related to signs, symptoms, diseases, disorders, and medical procedures were extracted from social media data using the clinical Text Analysis and Knowledge Extraction System. We mapped the mentions to standard medical concepts and then summarized these mapped concepts as topics using latent Dirichlet allocation. Finally, we applied this pipeline to understand BII from several BII-dedicated social media sites. RESULTS Our pipeline identified topics related to toxicity, cancer, and mental health issues that were highly associated with BII. Our pipeline also showed that cancers, autoimmune disorders, and mental health problems were emerging concerns associated with breast implants, based on social media discussions. Furthermore, the pipeline identified mentions such as rupture, infection, pain, and fatigue as common self-reported issues among the public, as well as concerns about toxicity from silicone implants. CONCLUSIONS Our study could inspire future studies on the suggested symptoms and factors of BII. Our study provides the first analysis and derived knowledge of BII from social media using natural language processing techniques and demonstrates the potential of using social media information to better understand similar emerging illnesses. CLINICALTRIAL


Author(s):  
Evrenii Polyakov ◽  
Leonid Voskov ◽  
Pavel Abramov ◽  
Sergey Polyakov

Introduction: Sentiment analysis is a complex problem whose solution essentially depends on the context, field of study andamount of text data. Analysis of publications shows that the authors often do not use the full range of possible data transformationsand their combinations. Only a part of the transformations is used, limiting the ways to develop high-quality classification models.Purpose: Developing and exploring a generalized approach to building a model, which consists in sequentially passing throughthe stages of exploratory data analysis, obtaining a basic solution, vectorization, preprocessing, hyperparameter optimization, andmodeling. Results: Comparative experiments conducted using a generalized approach for classical machine learning and deeplearning algorithms in order to solve the problem of sentiment analysis of short text messages in natural language processinghave demonstrated that the classification quality grows from one stage to another. For classical algorithms, such an increasein quality was insignificant, but for deep learning, it was 8% on average at each stage. Additional studies have shown that theuse of automatic machine learning which uses classical classification algorithms is comparable in quality to manual modeldevelopment; however, it takes much longer. The use of transfer learning has a small but positive effect on the classificationquality. Practical relevance: The proposed sequential approach can significantly improve the quality of models under developmentin natural language processing problems.


Author(s):  
Seonho Kim ◽  
Jungjoon Kim ◽  
Hong-Woo Chun

Interest in research involving health-medical information analysis based on artificial intelligence, especially for deep learning techniques, has recently been increasing. Most of the research in this field has been focused on searching for new knowledge for predicting and diagnosing disease by revealing the relation between disease and various information features of data. These features are extracted by analyzing various clinical pathology data, such as EHR (electronic health records), and academic literature using the techniques of data analysis, natural language processing, etc. However, still needed are more research and interest in applying the latest advanced artificial intelligence-based data analysis technique to bio-signal data, which are continuous physiological records, such as EEG (electroencephalography) and ECG (electrocardiogram). Unlike the other types of data, applying deep learning to bio-signal data, which is in the form of time series of real numbers, has many issues that need to be resolved in preprocessing, learning, and analysis. Such issues include leaving feature selection, learning parts that are black boxes, difficulties in recognizing and identifying effective features, high computational complexities, etc. In this paper, to solve these issues, we provide an encoding-based Wave2vec time series classifier model, which combines signal-processing and deep learning-based natural language processing techniques. To demonstrate its advantages, we provide the results of three experiments conducted with EEG data of the University of California Irvine, which are a real-world benchmark bio-signal dataset. After converting the bio-signals (in the form of waves), which are a real number time series, into a sequence of symbols or a sequence of wavelet patterns that are converted into symbols, through encoding, the proposed model vectorizes the symbols by learning the sequence using deep learning-based natural language processing. The models of each class can be constructed through learning from the vectorized wavelet patterns and training data. The implemented models can be used for prediction and diagnosis of diseases by classifying the new data. The proposed method enhanced data readability and intuition of feature selection and learning processes by converting the time series of real number data into sequences of symbols. In addition, it facilitates intuitive and easy recognition, and identification of influential patterns. Furthermore, real-time large-capacity data analysis is facilitated, which is essential in the development of real-time analysis diagnosis systems, by drastically reducing the complexity of calculation without deterioration of analysis performance by data simplification through the encoding process.


Author(s):  
Jayashree Rajesh ◽  
Priya Chitti Babu

In the current machine-centric world, humans expect a lot from machines right from waking us up. We expect them to do activities like reminding us on traffic, tracking of appointments, etc. The smart devices we have with us are creating a constructive impact on our day-to-day lives. Many of us have not thought about the communication between ourselves and the devices we have and the language we use for communication. Natural language processing runs behind all these activities and is currently playing a vital role with respect to the communication with humans with the use of virtual assistants like Alexa, Siri, and search engines like Bing, Google, etc. This implies that we are talking with the machines as if they are human. The advanced natural language processing techniques have drastically modified the way to discover and interact with data. In the recent world, the same advanced techniques are primarily used in the data analysis using NLP in business intelligence tools. This chapter elaborates the significance of natural language processing in business intelligence.


2021 ◽  
Vol 21 (S9) ◽  
Author(s):  
Na Hong ◽  
Fengxiang Chang ◽  
Zhengjie Ou ◽  
Yishang Wang ◽  
Yating Yang ◽  
...  

Abstract Background We aimed to build a common terminology in the domain of cervical cancer, named Cervical Cancer Common Terminology (CCCT), that will facilitate clinical data exchange, ensure quality of data and support large scale data analysis. Methods The standard concepts and relations of CCCT were collected from ICD-10-CM Chinese Version, ICD-9-PC Chinese Version, officially issued commonly used Chinese clinical terms, Chinese guidelines for diagnosis and treatment of cervical cancer and Chinese medical book Lin Qiaozhi Gynecologic Oncology. 2062 cervical cancer electronic medical records (EMRs) from 16 hospitals, belong to different regions and hospital tiers, were collected for terminology enrichment and building common terms and relations. Concepts hierarchies, terms and relationships were built using Protégé. The performance of natural language processing results was evaluated by average precision, recall, and F1-score. The usability of CCCT were evaluated by terminology coverage. Results A total of 880 standard concepts, 1182 common terms, 16 relations and 6 attributes were defined in CCCT, which organized in 6 levels and 11 classes. Initial evaluation of the natural language processing results demonstrated average precision, recall, and F1-score percentages of 96%, 72.6%, and 88.5%. The average terminology coverage for three classes of terms, clinical manifestation, treatment, and pathology, were 87.22%, 92.63%, and 89.85%, respectively. Flexible Chinese expressions exist between regions, traditions, cultures, and language habits within the country, linguistic variations in different settings and diverse translation of introduced western language terms are the main reasons of uncovered terms. Conclusions Our study demonstrated the initial results of CCCT construction. This study is an ongoing work, with the update of medical knowledge, more standard clinical concepts will be added in, and with more EMRs to be collected and analyzed, the term coverage will be continuing improved. In the future, CCCT will effectively support clinical data analysis in large scale.


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Aditi Gupta ◽  
Albert Lai ◽  
Jessica Mozersky ◽  
Xiaoteng Ma ◽  
Heidi Walsh ◽  
...  

Abstract Objective Sharing health research data is essential for accelerating the translation of research into actionable knowledge that can impact health care services and outcomes. Qualitative health research data are rarely shared due to the challenge of deidentifying text and the potential risks of participant reidentification. Here, we establish and evaluate a framework for deidentifying qualitative research data using automated computational techniques including removal of identifiers that are not considered HIPAA Safe Harbor (HSH) identifiers but are likely to be found in unstructured qualitative data. Materials and Methods We developed and validated a pipeline for deidentifying qualitative research data using automated computational techniques. An in-depth analysis and qualitative review of different types of qualitative health research data were conducted to inform and evaluate the development of a natural language processing (NLP) pipeline using named-entity recognition, pattern matching, dictionary, and regular expression methods to deidentify qualitative texts. Results We collected 2 datasets with 1.2 million words derived from over 400 qualitative research data documents. We created a gold-standard dataset with 280K words (70 files) to evaluate our deidentification pipeline. The majority of identifiers in qualitative data are non-HSH and not captured by existing systems. Our NLP deidentification pipeline had a consistent F1-score of ∼0.90 for both datasets. Conclusion The results of this study demonstrate that NLP methods can be used to identify both HSH identifiers and non-HSH identifiers. Automated tools to assist researchers with the deidentification of qualitative data will be increasingly important given the new National Institutes of Health (NIH) data-sharing mandate.


2019 ◽  
Vol 18 ◽  
pp. 160940691988702 ◽  
Author(s):  
William Leeson ◽  
Adam Resnick ◽  
Daniel Alexander ◽  
John Rovers

Qualitative data-analysis methods provide thick, rich descriptions of subjects’ thoughts, feelings, and lived experiences but may be time-consuming, labor-intensive, or prone to bias. Natural language processing (NLP) is a machine learning technique from computer science that uses algorithms to analyze textual data. NLP allows processing of large amounts of data almost instantaneously. As researchers become conversant with NLP, it is becoming more frequently employed outside of computer science and shows promise as a tool to analyze qualitative data in public health. This is a proof of concept paper to evaluate the potential of NLP to analyze qualitative data. Specifically, we ask if NLP can support conventional qualitative analysis, and if so, what its role is. We compared a qualitative method of open coding with two forms of NLP, Topic Modeling, and Word2Vec to analyze transcripts from interviews conducted in rural Belize querying men about their health needs. All three methods returned a series of terms that captured ideas and concepts in subjects’ responses to interview questions. Open coding returned 5–10 words or short phrases for each question. Topic Modeling returned a series of word-probability pairs that quantified how well a word captured the topic of a response. Word2Vec returned a list of words for each interview question ordered by which words were predicted to best capture the meaning of the passage. For most interview questions, all three methods returned conceptually similar results. NLP may be a useful adjunct to qualitative analysis. NLP may be performed after data have undergone open coding as a check on the accuracy of the codes. Alternatively, researchers can perform NLP prior to open coding and use the results to guide their creation of their codebook.


Sign in / Sign up

Export Citation Format

Share Document