Social Inferences in Agenesis of the Corpus Callosum and Autism: Semantic Analysis and Topic Modeling

Enriching Topic Coherence on Reviews for Cross-Domain Recommendation

The Computer Journal ◽

10.1093/comjnl/bxaa008 ◽

2020 ◽

Author(s):

Mala Saraswat ◽

Shampa Chakraverty

Keyword(s):

Topic Modeling ◽

Semantic Analysis ◽

User Generated Content ◽

Core Concept ◽

Semantic Clustering ◽

Cross Domain ◽

Semantic Coherence ◽

Reference Corpus ◽

Multiple Domains ◽

Explicit Semantic Analysis

Abstract With the advent of e-commerce sites and social media, users express their preferences and tastes freely through user-generated content such as reviews and comments. In order to promote cross-selling, e-commerce sites such as eBay and Amazon regularly use such inputs from multiple domains and suggest items with which users may be interested. In this paper, we propose a topic coherence-based cross-domain recommender model. The core concept is to use topic modeling to extract topics from user-generated content such as reviews and combine them with reliable semantic coherence techniques to link different domains, using Wikipedia as a reference corpus. We experiment with different topic coherence methods such as pointwise mutual information (PMI) and explicit semantic analysis (ESA). Experimental results presented demonstrate that our approach, using PMI as topic coherence, yields 22.6% and using ESA yields 54.4% higher precision as compared with cross-domain recommender system based on semantic clustering.

Download Full-text

A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-182776 ◽

2019 ◽

Vol 37 (5) ◽

pp. 6573-6588 ◽

Cited By ~ 2

Author(s):

Junaid Rashid ◽

Syed Muhammad Adnan Shah ◽

Aun Irtaza

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Semantic Analysis ◽

Text Corpora

Download Full-text

Semantic Analysis of Open Source Data for Syndromic Surveillance

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7651 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Erica Briscoe ◽

Scott Appling ◽

Edward Clarkson ◽

Nikolay Lipskiy ◽

James Tyson ◽

...

Keyword(s):

Public Health ◽

Social Media ◽

Real Time ◽

Topic Modeling ◽

Syndromic Surveillance ◽

Semantic Analysis ◽

Human In The Loop ◽

Media Messages ◽

Recent Advances ◽

Post Hoc

ObjectiveThe objective of this analysis is to leverage recent advances innatural language processing (NLP) to develop new methods andsystem capabilities for processing social media (Twitter messages)for situational awareness (SA), syndromic surveillance (SS), andevent-based surveillance (EBS). Specifically, we evaluated the useof human-in-the-loop semantic analysis to assist public health (PH)SA stakeholders in SS and EBS using massive amounts of publiclyavailable social media data.IntroductionSocial media messages are often short, informal, and ungrammatical.They frequently involve text, images, audio, or video, which makesthe identification of useful information difficult. This complexityreduces the efficacy of standard information extraction techniques1.However, recent advances in NLP, especially methods tailoredto social media2, have shown promise in improving real-time PHsurveillance and emergency response3. Surveillance data derived fromsemantic analysis combined with traditional surveillance processeshas potential to improve event detection and characterization. TheCDC Office of Public Health Preparedness and Response (OPHPR),Division of Emergency Operations (DEO) and the Georgia TechResearch Institute have collaborated on the advancement of PH SAthrough development of new approaches in using semantic analysisfor social media.MethodsTo understand how computational methods may benefit SS andEBS, we studied an iterative refinement process, in which the datauser actively cultivated text-based topics (“semantic culling”) in asemi-automated SS process. This ‘human-in-the-loop’ process wascritical for creating accurate and efficient extraction functions in large,dynamic volumes of data. The general process involved identifyinga set of expert-supplied keywords, which were used to collect aninitial set of social media messages. For purposes of this analysisresearchers applied topic modeling to categorize related messages intoclusters. Topic modeling uses statistical techniques to semanticallycluster and automatically determine salient aggregations. A user thensemantically culled messages according to their PH relevance.In June 2016, researchers collected 7,489 worldwide English-language Twitter messages (tweets) and compared three samplingmethods: a baseline random sample (C1, n=2700), a keyword-basedsample (C2, n=2689), and one gathered after semantically cullingC2 topics of irrelevant messages (C3, n=2100). Researchers utilizeda software tool, Luminoso Compass4, to sample and perform topicmodeling using its real-time modeling and Twitter integrationfeatures. For C2 and C3, researchers sampled tweets that theLuminoso service matched to both clinical and layman definitions ofRash, Gastro-Intestinal syndromes5, and Zika-like symptoms. Laymanterms were derived from clinical definitions from plain languagemedical thesauri. ANOVA statistics were calculated using SPSSsoftware, version. Post-hoc pairwise comparisons were completedusing ANOVA Turkey’s honest significant difference (HSD) test.ResultsAn ANOVA was conducted, finding the following mean relevancevalues: 3% (+/- 0.01%), 24% (+/- 6.6%) and 27% (+/- 9.4%)respectively for C1, C2, and C3. Post-hoc pairwise comparison testsshowed the percentages of discovered messages related to the eventtweets using C2 and C3 methods were significantly higher than forthe C1 method (random sampling) (p<0.05). This indicates that thehuman-in-the-loop approach provides benefits in filtering socialmedia data for SS and ESB; notably, this increase is on the basis ofa single iteration of semantic culling; subsequent iterations could beexpected to increase the benefits.ConclusionsThis work demonstrates the benefits of incorporating non-traditional data sources into SS and EBS. It was shown that an NLP-based extraction method in combination with human-in-the-loopsemantic analysis may enhance the potential value of social media(Twitter) for SS and EBS. It also supports the claim that advancedanalytical tools for processing non-traditional SA, SS, and EBSsources, including social media, have the potential to enhance diseasedetection, risk assessment, and decision support, by reducing the timeit takes to identify public health events.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering

Mehran University Research Journal of Engineering and Technology ◽

10.22581/muet1982.2001.20 ◽

2020 ◽

Vol 39 (1) ◽

pp. 213-222

Author(s):

Junaid Rashid ◽

Syed Muhammad Adnan Shah ◽

Aun Irtaza

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Topic Modeling ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

State Of The Art ◽

Text Documents ◽

New Perspective ◽

Better Than

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.

Download Full-text

Topic Modeling as a Tool for Analyzing Library Chat Transcripts

Information Technology and Libraries ◽

10.6017/ital.v40i3.13333 ◽

2021 ◽

Vol 40 (3) ◽

Author(s):

HyunSeung Koh ◽

Mark Fienup

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Academic Library ◽

Qualitative Evaluation ◽

Probabilistic Latent Semantic Analysis ◽

Analysis Tool ◽

Chat Reference ◽

Better Than

Library chat services are an increasingly important communication channel to connect patrons to library resources and services. Analysis of chat transcripts could provide librarians with insights into improving services. Unfortunately, chat transcripts consist of unstructured text data, making it impractical for librarians to go beyond simple quantitative analysis (e.g., chat duration, message count, word frequencies) with existing tools. As a stepping-stone toward a more sophisticated chat transcript analysis tool, this study investigated the application of different types of topic modeling techniques to analyze one academic library’s chat reference data collected from April 10, 2015, to May 31, 2019, with the goal of extracting the most accurate and easily interpretable topics. In this study, topic accuracy and interpretability—the quality of topic outcomes—were quantitatively measured with topic coherence metrics. Additionally, qualitative accuracy and interpretability were measured by the librarian author of this paper depending on the subjective judgment on whether topics are aligned with frequently asked questions or easily inferable themes in academic library contexts. This study found that from a human’s qualitative evaluation, Probabilistic Latent Semantic Analysis (pLSA) produced more accurate and interpretable topics, which is not necessarily aligned with the findings of the quantitative evaluation with all three types of topic coherence metrics. Interestingly, the commonly used technique Latent Dirichlet Allocation (LDA) did not necessarily perform better than pLSA. Also, semi-supervised techniques with human-curated anchor words of Correlation Explanation (CorEx) or guided LDA (GuidedLDA) did not necessarily perform better than an unsupervised technique of Dirichlet Multinomial Mixture (DMM). Last, the study found that using the entire transcript, including both sides of the interaction between the library patron and the librarian, performed better than using only the initial question asked by the library patron across different techniques in increasing the quality of topic outcomes.

Download Full-text

Semantic Pattern Detection in Covid-19 using Contextual Clustering and Intelligent Topic Modeling

International Journal of E-Health and Medical Communications ◽

10.4018/ijehmc.20220701oa07 ◽

2022 ◽

Vol 13 (2) ◽

pp. 0-0

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Pattern Detection ◽

Future Research ◽

Detection Approach ◽

Semantic Spaces ◽

And Control ◽

The Impact

The Covid-19 pandemic is the deadliest outbreak in our living memory. So, it is need of hour, to prepare the world with strategies to prevent and control the impact of the epidemics. In this paper, a novel semantic pattern detection approach in the Covid-19 literature using contextual clustering and intelligent topic modeling is presented. For contextual clustering, three level weights at term level, document level, and corpus level are used with latent semantic analysis. For intelligent topic modeling, semantic collocations using pointwise mutual information(PMI) and log frequency biased mutual dependency(LBMD) are selected and latent dirichlet allocation is applied. Contextual clustering with latent semantic analysis presents semantic spaces with high correlation in terms at corpus level. Through intelligent topic modeling, topics are improved in the form of lower perplexity and highly coherent. This research helps in finding the knowledge gap in the area of Covid-19 research and offered direction for future research.

Download Full-text

Analysis of OWA operators for automatic keyphrase extraction in a semantic context

Intelligent Data Analysis ◽

10.3233/ida-200008 ◽

2020 ◽

Vol 24 ◽

pp. 43-62

Author(s):

Yamel Pérez-Guadarramas ◽

Manuel Barreiro-Guerrero ◽

Alfredo Simón-Cuevas ◽

Francisco P. Romero ◽

José A. Olivas

Keyword(s):

Language Processing ◽

Topic Modeling ◽

Semantic Analysis ◽

Keyphrase Extraction ◽

Linguistic Features ◽

New Approach ◽

Owa Operators ◽

Modeling Process ◽

And Performance ◽

Computational Systems

Automatic keyphrase extraction from texts is useful for many computational systems in the fields of natural language processing and text mining. Although a number of solutions to this problem have been described, semantic analysis is one of the least exploited linguistic features in the most widely-known proposals, causing the results obtained to have low accuracy and performance rates. This paper presents an unsupervised method for keyphrase extraction, based on the use of lexico-syntactic patterns for extracting information from texts, and a fuzzy topic modeling. An OWA operator combining several semantic measures was applied to the topic modeling process. This new approach was evaluated with Inspec and 500N-KPCrowd datasets. Several approaches within our proposal were evaluated against each other. A statistical analysis was performed to substantiate the best approach of the proposal. This best approach was also compared with other reported systems, giving promising results.

Download Full-text

Estimating Topic Modeling Performance with Sharma–Mittal Entropy

Entropy ◽

10.3390/e21070660 ◽

2019 ◽

Vol 21 (7) ◽

pp. 660 ◽

Cited By ~ 6

Author(s):

Sergei Koltcov ◽

Vera Ignatenko ◽

Olessia Koltsova

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Statistical Physics ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Model Parameters ◽

Theoretical Ground ◽

Text Documents ◽

Optimizing Model

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.

Download Full-text

Semantic Pattern Detection in COVID-19 Using Contextual Clustering and Intelligent Topic Modeling

International Journal of E-Health and Medical Communications ◽

10.4018/ijehmc.20220701.oa7 ◽

2022 ◽

Vol 13 (2) ◽

pp. 1-17

Author(s):

Pooja Kherwa ◽

Poonam Bansal

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Pattern Detection ◽

Future Research ◽

Detection Approach ◽

Semantic Spaces ◽

And Control ◽

The Impact

The Covid-19 pandemic is the deadliest outbreak in our living memory. So, it is need of hour, to prepare the world with strategies to prevent and control the impact of the epidemics. In this paper, a novel semantic pattern detection approach in the Covid-19 literature using contextual clustering and intelligent topic modeling is presented. For contextual clustering, three level weights at term level, document level, and corpus level are used with latent semantic analysis. For intelligent topic modeling, semantic collocations using pointwise mutual information(PMI) and log frequency biased mutual dependency(LBMD) are selected and latent dirichlet allocation is applied. Contextual clustering with latent semantic analysis presents semantic spaces with high correlation in terms at corpus level. Through intelligent topic modeling, topics are improved in the form of lower perplexity and highly coherent. This research helps in finding the knowledge gap in the area of Covid-19 research and offered direction for future research.

Download Full-text