scholarly journals Topic Modelling in Bangla Language: An LDA Approach to Optimize Topics and News Classification

2018 ◽  
Vol 11 (4) ◽  
pp. 77 ◽  
Author(s):  
Malek Mouhoub ◽  
Mustakim Al Helal

Topic modeling is a powerful technique for unsupervised analysis of large document collections. Topic models have a wide range of applications including tag recommendation, text categorization, keyword extraction and similarity search in the text mining, information retrieval and statistical language modeling. The research on topic modeling is gaining popularity day by day. There are various efficient topic modeling techniques available for the English language as it is one of the most spoken languages in the whole world but not for the other spoken languages. Bangla being the seventh most spoken native language in the world by population, it needs automation in different aspects. This paper deals with finding the core topics of Bangla news corpus and classifying news with similarity measures. The document models are built using LDA (Latent Dirichlet Allocation) with bigram.


2021 ◽  
Vol 4 ◽  
Author(s):  
Prashanth Rao ◽  
Maite Taboada

We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women’s status as caregivers and men’s status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations’ efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.



2019 ◽  
Vol 74 (1) ◽  
pp. 20-29 ◽  
Author(s):  
Kun Kim ◽  
Ounjoung Park ◽  
Jacob Barr ◽  
Haejung Yun

Purpose The purpose of this research is to analyze the shifting perceptions of international tourists to Jeju Island and provide practical lessons to the tourism industry. Specifically, in regard to three United Nations Educational, Scientific and Cultural Organization (UNESCO) natural World Heritage sites in Jeju, this research measures the most salient topics mentioned by tourists to inform a more accurate perception of the island’s most valuable natural assets as reported by tourism experiences. Design/methodology/approach This study used a Web crawler to gather over 1,500 English language reviews from international tourists from a famous travel information website. The collected data were then preprocessed for stemming and lemmatization. After this, the processed text data were analyzed through a latent Dirichlet allocation (LDA)-based topic modeling approach to identify the most prominent clusters of ideas mentioned and represent them visually through graphs, tables and charts. Findings The findings from this research suggest that there are ten identifiable topics. Topics focusing on “adventure,” “summits” and “winter” showed noticeable increases, whereas topics focusing on “sunrise peak” and “UNESCO” have decreased over time. There is a trend for international tourists to be ever more conscious of the adventurous and rugged aspects of Jeju, and the novelty of mentioning UNESCO status seems to have worn off. Furthermore, there is the proclivity for tourists to mention “worth” and “enjoy” more as time goes on. Originality/value This study applies LDA-based topic modeling and LDAvis using user-generated online reviews with time-series analyses. Consequently, it provides unique insights into the changing perceptions of ecotourism on Jeju today, as well as contribution to smart tourism fields.



2020 ◽  
Vol 8 ◽  
pp. 439-453 ◽  
Author(s):  
Adji B. Dieng ◽  
Francisco J. R. Ruiz ◽  
David M. Blei

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.



2021 ◽  
Vol 15 (5) ◽  
pp. 169-183
Author(s):  
Eunhee Park

This study investigated the research trends of college English education in Korea from 2001 to 2020. The data was collected using a Biblio data collector and a total of 313 papers were analyzed. For research purposes, the data were analyzed using frequency analysis, LDA (Latent Dirichlet Allocation), and time series analysis. The summary of the findings is as follows: In the first instance, the number of research papers regarding college English education has increased significantly in quantity for 20 years. Secondly, in analyzing the topics of the chosen papers, a total of 10 topics in college English education were found. The topics were “curriculum and level-differentiated programs (T1)”, “learners’ affective factors (T2)”, “assesment and learning strategies (T3)”, “teachers’ factors (T4)”, “English vocabulary, grammar and writing (T5)”, “English for specific purposes (T6)”, “teaching and learning methods (T7)”, “web-based learning (T8)”, “learner-centered education (T9)”, and “textbook analysis etc. (T10).” Among these topics, the three that were identified as topics increasing in popularity were “learners’ affective factors (T2)”, “English for specific purposes (T6)”, and “learner-centered education (T9).” The topics increasing in popularity shared one key characteristic: the topics were related to learners’ factors such as the learners’ motivation, the learners’ goals, and the learners’ activities in class. This study is meaningful in that it collected a wide range of data related to college English education in Korea and produced more reliable results by using big data-based LDA topic modeling techniques.



2020 ◽  
Author(s):  
Sunil Nagpal ◽  
Divyanshu Srivastava ◽  
Sharmila S. Mande

ABSTRACTTopic modeling is frequently employed for discovering structures (or patterns) in a corpus of documents. Its utility in text-mining and document retrieval tasks in various fields of scientific research is rather well known. An unsupervised machine learning approach, Latent Dirichlet Allocation (LDA) has particularly been utilized for identifying latent (or hidden) topics in document collections and for deciphering the words that define one or more topics using a generative statistical model. Here we describe how SARS-CoV-2 genomic mutation profiles can be structured into a ‘Bag of Words’ to enable identification of signatures (topics) and their probabilistic distribution across various genomes using LDA. Topic models were generated using ~47000 novel corona virus genomes (considered as documents), leading to identification of 16 amino acid mutation signatures and 18 nucleotide mutation signatures (equivalent to topics) in the corpus of chosen genomes through coherence optimization. The document assumption for genomes also helped in identification of contextual nucleotide mutation signatures in the form of conventional N-grams (e.g. bi-grams and tri-grams). We validated the signatures obtained using LDA driven method against the previously reported recurrent mutations and phylogenetic clades for genomes. Additionally, we report the geographical distribution of the identified mutation signatures in SARS-CoV-2 genomes on the global map. Use of the non-phylogenetic albeit classical approaches like topic modeling and other data centric pattern mining algorithms is therefore proposed for supplementing the efforts towards understanding the genomic diversity of the evolving SARS-CoV-2 genomes (and other pathogens/microbes).



2020 ◽  
Author(s):  
Sakun Boon-Itt ◽  
Yukolpat Skunkan

BACKGROUND COVID-19 is a scientifically and medically novel disease that is not fully understood because it has yet to be consistently and deeply studied. Among the gaps in research on the COVID-19 outbreak, there is a lack of sufficient infoveillance data. OBJECTIVE The aim of this study was to increase understanding of public awareness of COVID-19 pandemic trends and uncover meaningful themes of concern posted by Twitter users in the English language during the pandemic. METHODS Data mining was conducted on Twitter to collect a total of 107,990 tweets related to COVID-19 between December 13 and March 9, 2020. The analyses included frequency of keywords, sentiment analysis, and topic modeling to identify and explore discussion topics over time. A natural language processing approach and the latent Dirichlet allocation algorithm were used to identify the most common tweet topics as well as to categorize clusters and identify themes based on the keyword analysis. RESULTS The results indicate three main aspects of public awareness and concern regarding the COVID-19 pandemic. First, the trend of the spread and symptoms of COVID-19 can be divided into three stages. Second, the results of the sentiment analysis showed that people have a negative outlook toward COVID-19. Third, based on topic modeling, the themes relating to COVID-19 and the outbreak were divided into three categories: the COVID-19 pandemic emergency, how to control COVID-19, and reports on COVID-19. CONCLUSIONS Sentiment analysis and topic modeling can produce useful information about the trends in the discussion of the COVID-19 pandemic on social media as well as alternative perspectives to investigate the COVID-19 crisis, which has created considerable public awareness. This study shows that Twitter is a good communication channel for understanding both public concern and public awareness about COVID-19. These findings can help health departments communicate information to alleviate specific public concerns about the disease.



10.2196/21978 ◽  
2020 ◽  
Vol 6 (4) ◽  
pp. e21978
Author(s):  
Sakun Boon-Itt ◽  
Yukolpat Skunkan

Background COVID-19 is a scientifically and medically novel disease that is not fully understood because it has yet to be consistently and deeply studied. Among the gaps in research on the COVID-19 outbreak, there is a lack of sufficient infoveillance data. Objective The aim of this study was to increase understanding of public awareness of COVID-19 pandemic trends and uncover meaningful themes of concern posted by Twitter users in the English language during the pandemic. Methods Data mining was conducted on Twitter to collect a total of 107,990 tweets related to COVID-19 between December 13 and March 9, 2020. The analyses included frequency of keywords, sentiment analysis, and topic modeling to identify and explore discussion topics over time. A natural language processing approach and the latent Dirichlet allocation algorithm were used to identify the most common tweet topics as well as to categorize clusters and identify themes based on the keyword analysis. Results The results indicate three main aspects of public awareness and concern regarding the COVID-19 pandemic. First, the trend of the spread and symptoms of COVID-19 can be divided into three stages. Second, the results of the sentiment analysis showed that people have a negative outlook toward COVID-19. Third, based on topic modeling, the themes relating to COVID-19 and the outbreak were divided into three categories: the COVID-19 pandemic emergency, how to control COVID-19, and reports on COVID-19. Conclusions Sentiment analysis and topic modeling can produce useful information about the trends in the discussion of the COVID-19 pandemic on social media as well as alternative perspectives to investigate the COVID-19 crisis, which has created considerable public awareness. This study shows that Twitter is a good communication channel for understanding both public concern and public awareness about COVID-19. These findings can help health departments communicate information to alleviate specific public concerns about the disease.



2018 ◽  
Vol 226 (1) ◽  
pp. 3-13 ◽  
Author(s):  
André Bittermann ◽  
Andreas Fischer

Abstract. Latent topics and trends in psychological publications were examined to identify hotspots in psychology. Topic modeling was contrasted with a classification-based scientometric approach in order to demonstrate the benefits of the former. Specifically, the psychological publication output in the German-speaking countries containing German- and English-language publications from 1980 to 2016 documented in the PSYNDEX database was analyzed. Topic modeling based on latent Dirichlet allocation (LDA) was applied to a corpus of 314,573 publications. Input for topic modeling was the controlled terms of the publications, that is, a standardized vocabulary of keywords in psychology. Based on these controlled terms, 500 topics were determined and trending topics were identified. Hot topics, indicated by the highest increasing trends in this data, were facets of neuropsychology, online therapy, cross-cultural aspects, traumatization, and visual attention. In conclusion, the findings indicate that topics can reveal more detailed insights into research trends than standardized classifications. Possible applications of this method, limitations, and implications for research synthesis are discussed.



2021 ◽  
pp. 1-16
Author(s):  
Ibtissem Gasmi ◽  
Mohamed Walid Azizi ◽  
Hassina Seridi-Bouchelaghem ◽  
Nabiha Azizi ◽  
Samir Brahim Belhaouari

Context-Aware Recommender System (CARS) suggests more relevant services by adapting them to the user’s specific context situation. Nevertheless, the use of many contextual factors can increase data sparsity while few context parameters fail to introduce the contextual effects in recommendations. Moreover, several CARSs are based on similarity algorithms, such as cosine and Pearson correlation coefficients. These methods are not very effective in the sparse datasets. This paper presents a context-aware model to integrate contextual factors into prediction process when there are insufficient co-rated items. The proposed algorithm uses Latent Dirichlet Allocation (LDA) to learn the latent interests of users from the textual descriptions of items. Then, it integrates both the explicit contextual factors and their degree of importance in the prediction process by introducing a weighting function. Indeed, the PSO algorithm is employed to learn and optimize weights of these features. The results on the Movielens 1 M dataset show that the proposed model can achieve an F-measure of 45.51% with precision as 68.64%. Furthermore, the enhancement in MAE and RMSE can respectively reach 41.63% and 39.69% compared with the state-of-the-art techniques.



Sign in / Sign up

Export Citation Format

Share Document