scholarly journals Root cause analysis of COVID-19 cases by enhanced text mining process

Author(s):  
Sujatha Arun Kokatnoor ◽  
Balachandran Krishnan

<p>The main focus of this research is to find the reasons behind the fresh cases of COVID-19 from the public’s perception for data specific to India. The analysis is done using machine learning approaches and validating the inferences with medical professionals. The data processing and analysis is accomplished in three steps. First, the dimensionality of the vector space model (VSM) is reduced with improvised feature engineering (FE) process by using a weighted term frequency-inverse document frequency (TF-IDF) and forward scan trigrams (FST) followed by removal of weak features using feature hashing technique. In the second step, an enhanced K-means clustering algorithm is used for grouping, based on the public posts from Twitter®. In the last step, latent dirichlet allocation (LDA) is applied for discovering the trigram topics relevant to the reasons behind the increase of fresh COVID-19 cases. The enhanced K-means clustering improved Dunn index value by 18.11% when compared with the traditional K-means method. By incorporating improvised two-step FE process, LDA model improved by 14% in terms of coherence score and by 19% and 15% when compared with latent semantic analysis (LSA) and hierarchical dirichlet process (HDP) respectively thereby resulting in 14 root causes for spike in the disease.</p>

Author(s):  
A.S. Li ◽  
A.J.C. Trappey ◽  
C.V. Trappey

A registered trademark distinctively identifies a company, its products or services. A trademark (TM) is a type of intellectual property (IP) which is protected by the laws in the country where the trademark is officially registered. TM owners may take legal action when their IP rights are infringed upon. TM legal cases have grown in pace with the increasing number of TMs registered globally. In this paper, an intelligent recommender system automatically identifies similar TM case precedents for any given target case to support IP legal research. This study constructs the semantic network representing the TM legal scope and terminologies. A system is built to identify similar cases based on the machine-readable, frame-based knowledge representations of the judgments/documents. In this research, 4,835 US TM legal cases litigated in the US district and federal courts are collected as the experimental dataset. The computer-assisted system is constructed to extract critical features based on the ontology schema. The recommender will identify similar prior cases according to the values of their features embedded in these legal documents which include the case facts, issues under disputes, judgment holdings, and applicable rules and laws. Term frequency-inverse document frequency is used for text mining to discover the critical features of the litigated cases. Soft clustering algorithm, e.g., Latent Dirichlet Allocation, is applied to generate topics and the cases belonging to these topics. Thus, similar cases under each topic are identified for references. Through the analysis of the similarity between the cases based on the TM legal semantic analysis, the intelligent recommender provides precedents to support TM legal action and strategic planning.


2020 ◽  
Author(s):  
Jia Xue ◽  
Junxiang Chen ◽  
Ran Hu ◽  
Chen Chen ◽  
Chengda Zheng ◽  
...  

BACKGROUND It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. OBJECTIVE The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. METHODS We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. RESULTS Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. CONCLUSIONS This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.


Author(s):  
Junaid Rashid ◽  
Syed Muhammad Adnan Shah ◽  
Aun Irtaza

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.


2020 ◽  
Vol 18 (1) ◽  
pp. 1-7
Author(s):  
Adnen Mahmoud ◽  
Mounir Zrigui

Paraphrase detection allows determining how original and suspect documents convey the same meaning. It has attracted attention from researchers in many Natural Language Processing (NLP) tasks such as plagiarism detection, question answering, information retrieval, etc., Traditional methods (e.g., Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA)) cannot capture efficiently hidden semantic relations when sentences may not contain any common words or the co-occurrence of words is rarely present. Therefore, we proposed a deep learning model based on Global Word embedding (GloVe) and Recurrent Convolutional Neural Network (RCNN). It was efficient for capturing more contextual dependencies between words vectors with precise semantic meanings. Seeing the lack of resources in Arabic language publicly available, we developed a paraphrased corpus automatically. It preserved syntactic and semantic structures of Arabic sentences using word2vec model and Part-Of-Speech (POS) annotation. Overall experiments shown that our proposed model outperformed the state-of-the-art methods in terms of precision and recall


10.2196/20550 ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. e20550
Author(s):  
Jia Xue ◽  
Junxiang Chen ◽  
Ran Hu ◽  
Chen Chen ◽  
Chengda Zheng ◽  
...  

Background It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. Objective The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. Methods We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. Results Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. Conclusions This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Wenhao Chen ◽  
Kin Keung Lai ◽  
Yi Cai

PurposeSina Weibo and Twitter are the top microblogging platforms with billions of users. Accordingly, these two platforms could be used to understand the public mood. In this paper, the authors want to discuss how to generate and compare the public mood on Sina Weibo and Twitter. The predictive power of the public mood toward commodity markets is discussed, and the authors want to solve the problem that how to choose between Sina Weibo and Twitter when predicting crude oil prices.Design/methodology/approachAn enhanced latent Dirichlet allocation model considering term weights is implemented to generate topics from Sina Weibo and Twitter. Granger causality test and a long short-term memory neural network model are used to demonstrate that the public mood on Sina Weibo and Twitter is correlated with commodity contracts.FindingsBy comparing the topics and the public mood on Sina Weibo and Twitter, the authors find significant differences in user behavior on these two websites. Besides, the authors demonstrate that public mood on Sina Weibo and Twitter is correlated with crude oil contract prices in Shanghai International Energy Exchange and New York Mercantile Exchange, respectively.Originality/valueTwo sentiment analysis methods for Chinese (Sina Weibo) and English (Twitter) posts are introduced, which can be reused for other semantic analysis tasks. Besides, the authors present a prediction model for the practical participants in the commodity markets and introduce a method to choose between Sina Weibo and Twitter for certain prediction tasks.


2021 ◽  
Vol 6 (1) ◽  
pp. 17
Author(s):  
Kartika Rizqi Nastiti ◽  
Ahmad Fathan Hidayatullah ◽  
Ahmad Rafie Pratama

Before conducting a research project, researchers must find the trends and state of the art in their research field. However, that is not necessarily an easy job for researchers, partly due to the lack of specific tools to filter the required information by time range. This study aims to provide a solution to that problem by performing a topic modeling approach to the scraped data from Google Scholar between 2010 and 2019. We utilized Latent Dirichlet Allocation (LDA) combined with Term Frequency-Indexed Document Frequency (TF-IDF) to build topic models and employed the coherence score method to determine how many different topics there are for each year’s data. We also provided a visualization of the topic interpretation and word distribution for each topic as well as its relevance using word cloud and PyLDAvis. In the future, we expect to add more features to show the relevance and interconnections between each topic to make it even easier for researchers to use this tool in their research projects.


2017 ◽  
Vol 11 (03) ◽  
pp. 373-389
Author(s):  
Sara Santilli ◽  
Laura Nota ◽  
Giovanni Pilato

In the present work Latent Semantic Analysis of textual data was applied on texts related to courage, in order to compare and contrast results and evaluate the opportunity of integrating different data sets. To better understand the definition of courage in Italian context, 1199 participants were involved in the present study and was asked to answer to the following question “Courage is[Formula: see text]”. The participants’ definitions of courage were analyzed with the Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), in order to study the fundamental concepts arising from the population. An analogous comparison with Twitter posts has been also carried out to analyze if the public opinion emerging from social media provides a challenging and rich context to explore computational models of natural language.


Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 394
Author(s):  
Sergei Koltcov ◽  
Vera Ignatenko ◽  
Zeyd Boukhers ◽  
Steffen Staab

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.


2021 ◽  
Vol 11 (13) ◽  
pp. 6113
Author(s):  
Adam Wawrzyński ◽  
Julian Szymański

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models.


Sign in / Sign up

Export Citation Format

Share Document