Open-Ended Questions

Employee Surveys and Sensing ◽

10.1093/oso/9780190939717.003.0013 ◽

2020 ◽

pp. 202-218

Author(s):

Subhadra Dutta ◽

Eric M. O’Rourke

Keyword(s):

Machine Learning ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Written Language ◽

Text Data ◽

Employee Survey ◽

Trade Offs ◽

Word Relatedness ◽

Survey Responses

Natural language processing (NLP) is the field of decoding human written language. This chapter responds to the growing interest in using machine learning–based NLP approaches for analyzing open-ended employee survey responses. These techniques address scalability and the ability to provide real-time insights to make qualitative data collection equally or more desirable in organizations. The chapter walks through the evolution of text analytics in industrial–organizational psychology and discusses relevant supervised and unsupervised machine learning NLP methods for survey text data, such as latent Dirichlet allocation, latent semantic analysis, sentiment analysis, word relatedness methods, and so on. The chapter also lays out preprocessing techniques and the trade-offs of growing NLP capabilities internally versus externally, points the readers to available resources, and ends with discussing implications and future directions of these approaches.

Download Full-text

Designing a Chat-Bot for College Information using Information Retrieval and Automatic Text Summarization Techniques

Current Chinese Computer Science ◽

10.2174/2665997201999201022191540 ◽

2020 ◽

Vol 01 ◽

Author(s):

Radha Guha

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Text Summarization ◽

The Internet ◽

Specific Domain ◽

User Query ◽

College Information ◽

Chat Bot

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.

Download Full-text

Identifying Medication-related Intents from a Bidirectional Text Messaging Platform for Hypertension Management: An Unsupervised Learning Approach

10.1101/2021.12.23.21268061 ◽

2021 ◽

Author(s):

Anahita Davoudi ◽

Natalie Lee ◽

Thaibinh Luong ◽

Timothy Delaney ◽

Elizabeth Asch ◽

...

Keyword(s):

Blood Pressure ◽

Unsupervised Learning ◽

Language Processing ◽

Text Messaging ◽

Latent Dirichlet Allocation ◽

Clinical Care ◽

Hypertension Management ◽

Free Text ◽

Significant Heterogeneity ◽

Text Data

Background: Free-text communication between patients and providers is playing an increasing role in chronic disease management, through platforms varying from traditional healthcare portals to more novel mobile messaging applications. These text data are rich resources for clinical and research purposes, but their sheer volume render them difficult to manage. Even automated approaches such as natural language processing require labor-intensive manual classification for developing training datasets, which is a rate-limiting step. Automated approaches to organizing free-text data are necessary to facilitate the use of free-text communication for clinical care and research. Objective: We applied unsupervised learning approaches to 1) understand the types of topics discussed and 2) to learn medication-related intents from messages sent between patients and providers through a bi-directional text messaging system for managing participant blood pressure. Methods: This study was a secondary analysis of de-identified messages from a remote mobile text-based employee hypertension management program at an academic institution. In experiment 1, we trained a Latent Dirichlet Allocation (LDA) model for each message type (inbound-patient and outbound-provider) and identified the distribution of major topics and significant topics (probability >0.20) across message types. In experiment 2, we annotated all medication-related messages with a single medication intent. Then, we trained a second LDA model (medLDA) to assess how well the unsupervised method could identify more fine-grained medication intents. We encoded each medication message with n-grams (n-1-3 words) using spaCy, clinical named entities using STANZA, and medication categories using MedEx, and then applied Chi-square feature selection to learn the most informative features associated with each medication intent. Results: A total of 253 participants and 5 providers engaged in the program generating 12,131 total messages: 47% patient messages and 53% provider messages. Most patient messages correspond to blood pressure (BP) reporting, BP encouragement, and appointment scheduling. In contrast, most provider messages correspond to BP reporting, medication adherence, and confirmatory statements. In experiment 1, for both patient and provider messages, most messages contained 1 topic and few with more than 3 topics identified using LDA. However, manual review of some messages within topics revealed significant heterogeneity even within single-topic messages as identified by LDA. In experiment 2, among the 534 medication messages annotated with a single medication intent, most of the 282 patient medication messages referred to medication request (48%; n=134) and medication taking (28%; n=79); most of the 252 provider medication messages referred to medication question (69%; n=173). Although medLDA could identify a majority intent within each topic, the model could not distinguish medication intents with low prevalence within either patient or provider messages. Richer feature engineering identified informative lexical-semantic patterns associated with each medication intent class. Conclusion: LDA can be an effective method for generating subgroups of messages with similar term usage and facilitate the review of topics to inform annotations. However, few training cases and shared vocabulary between intents precludes the use of LDA for fully automated deep medication intent classification.

Download Full-text

A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2020.2.3811 ◽

2020 ◽

Vol 15 (2) ◽

Author(s):

Jia Luo ◽

Dongwen Yu ◽

Zong Dai

Keyword(s):

Machine Learning ◽

Fuzzy Clustering ◽

Latent Dirichlet Allocation ◽

Learning Model ◽

Machine Learning Algorithms ◽

Text Data ◽

Huge Data ◽

Machine Learning Model ◽

N Gram ◽

Dirichlet Allocation

It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.

Download Full-text

Assessing the Heterogeneity of Complaints Related to Tinnitus and Hyperacusis from an Unsupervised Machine Learning Approach: An Exploratory Study

Audiology and Neurotology ◽

10.1159/000504741 ◽

2020 ◽

Vol 25 (4) ◽

pp. 174-189 ◽

Cited By ~ 1

Author(s):

Guillaume Palacios ◽

Arnaud Noreña ◽

Alain Londero

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Language Processing ◽

Exploratory Study ◽

Latent Dirichlet Allocation ◽

Suicide Attempts ◽

Real Life ◽

Supervised Machine Learning ◽

Learning Approach ◽

Machine Learning Approach

Introduction: Subjective tinnitus (ST) and hyperacusis (HA) are common auditory symptoms that may become incapacitating in a subgroup of patients who thereby seek medical advice. Both conditions can result from many different mechanisms, and as a consequence, patients may report a vast repertoire of associated symptoms and comorbidities that can reduce dramatically the quality of life and even lead to suicide attempts in the most severe cases. The present exploratory study is aimed at investigating patients’ symptoms and complaints using an in-depth statistical analysis of patients’ natural narratives in a real-life environment in which, thanks to the anonymization of contributions and the peer-to-peer interaction, it is supposed that the wording used is totally free of any self-limitation and self-censorship. Methods: We applied a purely statistical, non-supervised machine learning approach to the analysis of patients’ verbatim exchanged on an Internet forum. After automated data extraction, the dataset has been preprocessed in order to make it suitable for statistical analysis. We used a variant of the Latent Dirichlet Allocation (LDA) algorithm to reveal clusters of symptoms and complaints of HA patients (topics). The probability of distribution of words within a topic uniquely characterizes it. The convergence of the log-likelihood of the LDA-model has been reached after 2,000 iterations. Several statistical parameters have been tested for topic modeling and word relevance factor within each topic. Results: Despite a rather small dataset, this exploratory study demonstrates that patients’ free speeches available on the Internet constitute a valuable material for machine learning and statistical analysis aimed at categorizing ST/HA complaints. The LDA model with K = 15 topics seems to be the most relevant in terms of relative weights and correlations with the capability to individualizing subgroups of patients displaying specific characteristics. The study of the relevance factor may be useful to unveil weak but important signals that are present in patients’ narratives. Discussion/Conclusion: We claim that the LDA non-supervised approach would permit to gain knowledge on the patterns of ST- and HA-related complaints and on patients’ centered domains of interest. The merits and limitations of the LDA algorithms are compared with other natural language processing methods and with more conventional methods of qualitative analysis of patients’ output. Future directions and research topics emerging from this innovative algorithmic analysis are proposed.

Download Full-text

Thematic Context Derivator Algorithm for Enhanced Context Vector Machine: eCVM

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b4564.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4872-4877

Keyword(s):

Language Processing ◽

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Named Entities ◽

Pagerank Algorithm ◽

Context Vector ◽

Improved Performance ◽

Evaluation Parameters ◽

Thematic Context

Natural Language Processing uses word embeddings to map words into vectors. Context vector is one of the techniques to map words into vectors. The context vector gives importance of terms in the document corpus. The derivation of context vector is done using various methods such as neural networks, latent semantic analysis, knowledge base methods etc. This paper proposes a novel system to devise an enhanced context vector machine called eCVM. eCVM is able to determine the context phrases and its importance. eCVM uses latent semantic analysis, existing context vector machine, dependency parsing, named entities, topics from latent dirichlet allocation and various forms of words like nouns, adjectives and verbs for building the context. eCVM uses context vector and Pagerank algorithm to find the importance of the term in document and is tested on BBC news dataset. Results of eCVM are compared with compared with the state of the art for context detrivation. The proposed system shows improved performance over existing systems for standard evaluation parameters.

Download Full-text

Public Opinions and Concerns Regarding the Canadian Prime Minister’s Daily COVID-19 Briefing: Longitudinal Study of YouTube Comments Using Machine Learning Techniques

Journal of Medical Internet Research ◽

10.2196/23957 ◽

2021 ◽

Vol 23 (2) ◽

pp. e23957

Author(s):

Chengda Zheng ◽

Jia Xue ◽

Yumin Sun ◽

Tingshao Zhu

Keyword(s):

Machine Learning ◽

Language Processing ◽

Public Discourse ◽

Latent Dirichlet Allocation ◽

Prime Minister ◽

Machine Learning Techniques ◽

Future Health ◽

The Public ◽

Learning Techniques ◽

The Government

Background During the COVID-19 pandemic in Canada, Prime Minister Justin Trudeau provided updates on the novel coronavirus and the government’s responses to the pandemic in his daily briefings from March 13 to May 22, 2020, delivered on the official Canadian Broadcasting Corporation (CBC) YouTube channel. Objective The aim of this study was to examine comments on Canadian Prime Minister Trudeau’s COVID-19 daily briefings by YouTube users and track these comments to extract the changing dynamics of the opinions and concerns of the public over time. Methods We used machine learning techniques to longitudinally analyze a total of 46,732 English YouTube comments that were retrieved from 57 videos of Prime Minister Trudeau’s COVID-19 daily briefings from March 13 to May 22, 2020. A natural language processing model, latent Dirichlet allocation, was used to choose salient topics among the sampled comments for each of the 57 videos. Thematic analysis was used to classify and summarize these salient topics into different prominent themes. Results We found 11 prominent themes, including strict border measures, public responses to Prime Minister Trudeau’s policies, essential work and frontline workers, individuals’ financial challenges, rental and mortgage subsidies, quarantine, government financial aid for enterprises and individuals, personal protective equipment, Canada and China’s relationship, vaccines, and reopening. Conclusions This study is the first to longitudinally investigate public discourse and concerns related to Prime Minister Trudeau’s daily COVID-19 briefings in Canada. This study contributes to establishing a real-time feedback loop between the public and public health officials on social media. Hearing and reacting to real concerns from the public can enhance trust between the government and the public to prepare for future health emergencies.

Download Full-text

Concept of TF-IDF, Common Bag of Word and Word Embedding for Effective Sentiment Classification

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f4582.049620 ◽

2020 ◽

Vol 9 (4) ◽

pp. 2198-2201

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Sentiment Classification ◽

Word Embedding ◽

Text Representation ◽

Human Beings ◽

Text Data

Sentiment Classification is one of the well-known and most popular domain of machine learning and natural language processing. An algorithm is developed to understand the opinion of an entity similar to human beings. This research fining article presents a similar to the mention above. Concept of natural language processing is considered for text representation. Later novel word embedding model is proposed for effective classification of the data. Tf-IDF and Common BoW representation models were considered for representation of text data. Importance of these models are discussed in the respective sections. The proposed is testing using IMDB datasets. 50% training and 50% testing with three random shuffling of the datasets are used for evaluation of the model.

Download Full-text

Trend analysis of online travel review text mining over time

Journal of Modelling in Management ◽

10.1108/jm2-10-2018-0178 ◽

2019 ◽

Vol 15 (2) ◽

pp. 491-508

Author(s):

Kaile Zhang ◽

Ichiro Koshijima

Keyword(s):

Text Mining ◽

Language Processing ◽

Semantic Analysis ◽

Direct Method ◽

Dimensional Space ◽

Single Point ◽

Text Data ◽

Content Type ◽

New Ideas ◽

Processing Techniques

Purpose The reviews of online tourism have not been taken advantage of effectively because the text data of such reviews is enormous and its current, in-depth research is still in infancy. Therefore, it is expected that the text data could be processed by the method of text mining to better understand the implicit information. The purpose of this paper is to contribute to tourism practitioners and tourists to conveniently use the texts through appropriate visualization processing techniques. In particular, time-changing reviews can be used to reflect the changes in tourists’ feedback and concerns. Design/methodology/approach Latent semantic analysis is a new branch of semantics. Every term in the document can be regarded as a single point in multi-dimensional space. When a document with semantics comes into such space, the distribution of the document is not random, but will obey some type of semantic structure. Findings First, overall grasping for the big data is applicable. Second, propose a direct method is proposed that allows more non-language processing researchers or proprietors to use the data. Lastly, the results of changes in different spans of times are investigated. Originality/value This paper proposes an approach to disclose a significant number of travel comments from different years that may generate new ideas for tourism. The authors put forward a processing approach to deal with large amounts of texts of comments. Using the case study of Mt. Lushan, the various changes of travel reviews over the years are successfully visualized and displayed.

Download Full-text

Web-Based Text Analysis of the Patient Safety Concerns of Various Healthcare Stakeholders

10.3233/shti210711 ◽

2021 ◽

Author(s):

Insook Cho ◽

Minyoung Lee ◽

Yeonjin Kim

Keyword(s):

Patient Safety ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Quality Of Healthcare ◽

Fundamental Aspect ◽

Serious Adverse Events ◽

Text Data ◽

Web Based ◽

Korean Government ◽

Set Up

Patient safety is a fundamental aspect of the quality of healthcare and there is a growing interest in improving safety among healthcare stakeholders in many countries. The Korean government recognized that patient safety is a threat to society following several serious adverse events, and so the Ministry of Health and Welfare of the Korean government set up the Patient Safety Act in January 2015. This study analyzed text data on patient safety collected from web-based, user-generated documents related to the legislation to see if they accurately represent the specific concerns of various healthcare stakeholders. We adopted the unsupervised natural language processing method of probabilistic topic modeling and also Latent Dirichlet Allocation. The results showed that text data are useful for inferring the latent concerns of healthcare consumers, providers, government bodies, and researchers as well as changes therein over time.

Download Full-text

Detection of Cyberbullying on Social Media Using Machine learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38635 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1401-1409

Author(s):

Mitta Roja

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Extraction ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Hate Speech ◽

Text Data ◽

Model Based

Abstract: Cyberbullying is a major problem encountered on internet that affects teenagers and also adults. It has lead to mishappenings like suicide and depression. Regulation of content on Social media platorms has become a growing need. The following study uses data from two different forms of cyberbullying, hate speech tweets from Twittter and comments based on personal attacks from Wikipedia forums to build a model based on detection of Cyberbullying in text data using Natural Language Processing and Machine learning. Threemethods for Feature extraction and four classifiers are studied to outline the best approach. For Tweet data the model provides accuracies above 90% and for Wikipedia data it givesaccuracies above 80%. Keywords: Cyberbullying, Hate speech, Personal attacks,Machine learning, Feature extraction, Twitter, Wikipedia

Download Full-text