A Recommendation Mechanism for Under-Emphasized Tourist Spots Using Topic Modeling and Sentiment Analysis

With rapid advancements in internet applications, the growth rate of recommendation systems for tourists has skyrocketed. This has generated an enormous amount of travel-based data in the form of reviews, blogs, and ratings. However, most recommendation systems only recommend the top-rated places. Along with the top-ranked places, we aim to discover places that are often ignored by tourists owing to lack of promotion or effective advertising, referred to as under-emphasized locations. In this study, we use all relevant data, such as travel blogs, ratings, and reviews, in order to obtain optimal recommendations. We also aim to discover the latent factors that need to be addressed, such as food, cleanliness, and opening hours, and recommend a tourist place based on user history data. In this study, we propose a cross mapping table approach based on the location’s popularity, ratings, latent topics, and sentiments. An objective function for recommendation optimization is formulated based on these mappings. The baseline algorithms are latent Dirichlet allocation (LDA) and support vector machine (SVM). Our results show that the combined features of LDA, SVM, ratings, and cross mappings are conducive to enhanced performance. The main motivation of this study was to help tourist industries to direct more attention towards designing effective promotional activities for under-emphasized locations.

Download Full-text

Analysis and Visualization Latent Topic on COVID-19 Vaccine Tweet use two-stage topic modeling (Preprint)

10.2196/preprints.30290 ◽

2021 ◽

Author(s):

Faizah Faizah ◽

Bor-Shen Lin

Keyword(s):

Topic Modeling ◽

Public Perception ◽

Latent Dirichlet Allocation ◽

World Health ◽

Two Stage ◽

The Public ◽

Global Pandemic ◽

Difficult Time ◽

Latent Topic ◽

Latent Topics

BACKGROUND The World Health Organization (WHO) declared COVID-19 as a global pandemic on January 30, 2020. However, the pandemic has not been over yet. Furthermore, in the first quartal of 2021, some countries face the third wave of the pandemic. During the difficult time, the development of the vaccines for COVID-19 accelerates rapidly. Understanding the public perception of the COVID-19 Vaccine according to the data collected from social media can widen the perspective on the state of the global pandemic OBJECTIVE This study explores and analyzes the latent topic on COVID-19 Vaccine Tweet posted by individuals from various countries by using two-stage topic modeling. METHODS A two-stage analysis in topic modeling was proposed to investigating people’s reactions in five countries. The first stage is Latent Dirichlet Allocation that produces the latent topics with the corresponding term distributions that facilitate the investigators to understand the main issues or opinions. The second stage then performs agglomerative clustering on the latent topics based on Hellinger distance, which merges close topics hierarchically into topic clusters to visualize those topics in either tree or graph views. RESULTS In general, the topic discussion regarding the COVID-19 Vaccine in five countries is similar. Topic themes such as "first vaccine" and & "vaccine effect" dominate the public discussion. The remarkable point is that people in some countries have some topic themes, such as "politician opinion" and " stay home" in Canada, "emergency" in India, and & "blood clots" in the United Kingdom. The analysis also shows the most popular COVID-19 Vaccine, which is gaining more public interest. CONCLUSIONS With LDA and Hierarchical clustering, two-stage topic modeling is powerful for visualizing the latent topics and understanding the public perception regarding the COVID-19 Vaccine.

Download Full-text

Understanding vaccine hesitancy with application of Latent Dirichlet Allocation to Reddit Corpora

10.21203/rs.3.rs-616664/v1 ◽

2021 ◽

Author(s):

Samuel Duraivel ◽

Lavanya R

Keyword(s):

Global Warming ◽

Side Effects ◽

Latent Dirichlet Allocation ◽

Vaccine Hesitancy ◽

Latent Factors ◽

Racial Background ◽

Racial Injustice ◽

Latent Topics ◽

Mass Surveillance ◽

Dirichlet Allocation

Abstract This research paper explores the underlying factors that contribute toward vaccine hesitancy, resistance, and refusal. Using Latent Dirichlet Allocation (LDA), an unsupervised generative-probabilistic model, we generated latent topics from user generated Reddit corpora on reasons for Vaccine hesitancy. Although we hoped to explore the grounds for vaccine hesitancy across the globe, our findings suggest that the corpus used for analysis had been generated by users living predominantly in the United States.Observation of the topics generated by the LDA model led to the discovery of the following latent factors: (i) fear of risks and side effects, (ii) lack of trust in policymakers, (iii) related to religious belief, (iv) related to mass surveillance theories, (v) perception of vaccination as a precedence to totalitarianism, (vi) racial background pertaining to retrospective events of racial injustice, such as selective sterilization, (vii) depopulation agenda fueled by theories affiliated to Global warming and extinction rebellion, (viii) and perception of vaccination as a campaign to quell immigrant population growth, fueled by reports of coerced sterilization of immigrants in the ICE detention.

Download Full-text

Authorship verification

10.12681/eadd/45382 ◽

2019 ◽

Author(s):

Νεκταρία Πόθα

Keyword(s):

Cyber Security ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

State Of The Art ◽

Latent Semantic Indexing ◽

Semantic Indexing ◽

Authorship Verification ◽

Latent Topics ◽

Authorship Analysis ◽

Dirichlet Allocation

Η περιοχή της ανάλυσης συγγραφέα (Authorship Analysis) αποσκοπεί στην άντληση πληροφοριών σχετικά με τους συγγραφείς ψηφιακών κειμένων. Συνδέεται άμεσα με πολλές εφαρμογές καθώς είναι εφικτό να χρησιμοποιηθεί για την ανάλυση οποιουδήποτε είδους(genre) κειμένων: λογοτεχνικών έργων, άρθρων εφημερίδων, αναρτήσεις σε κοινωνικά δίκτυα κλπ. Οι περιοχές εφαρμογών της τεχνολογίας αυτής διακρίνονται σε φιλολογικές (humanities),(π.χ. ποιος είναι ο συγγραφέας ενός λογοτεχνικού έργου που εκδόθηκε ανώνυμα, ποιος είναι ο συγγραφέας έργων που έχουν εκδοθεί με ψευδώνυμο, επαλήθευση της πατρότητας λογοτεχνικών έργων γνωστών συγγραφέων κτλ.), εγκληματολογικές (forensics) (π.χ. εύρεση υφολογικών ομοιοτήτων μεταξύ προκηρύξεων τρομοκρατικών ομάδων, διερεύνηση αυθεντικότητας σημειώματος αυτοκτονίας, αποκάλυψη πολλαπλών λογαριασμών χρήστη σε κοινωνικά δίκτυα που αντιστοιχούν στο ίδιο άτομο κτλ.) και στον τομέα της ασφάλειας του κυβερνοχώρου (cyber-security) (π.χ. εύρεση υφολογικών ομοιοτήτων μεταξύ χρηστών πολλαπλών ψευδωνύμων).Θεμελιώδες ερευνητικό πεδίο της ανάλυσης συγγραφέα αποτελεί η επαλήθευση συγγραφέα (author verification), όπου δεδομένου ενός συνόλου κειμένων (σε ηλεκτρονική μορφή) από τον ίδιο συγγραφέα (υποψήφιος συγγραφέας) καλούμαστε να αποφασίσουμε αν ένα άλλο κείμενο (άγνωστης ή αμφισβητούμενης συγγραφικής προέλευσης) έχει γραφτεί από τον συγγραφέα αυτόν ή όχι. Η επαλήθευση συγγραφέα έχει αποκτήσει ιδιαίτερο ενδιαφέρον τα τελευταία χρόνια κυρίως λόγω των πειραματικών αξιολογήσεων PAN@CLEF. Συγκεκριμένα, από το 2013 εως το 2015 οι διαγωνισμοί PAN είχαν εστιάσει στο πεδίο της επαλήθευσης συγγραφέα παρέχοντας ένα καλά οργανωμένο σύνολο δεδομένων (PAN corpora) και συγκεντρώνοντας πλήθος μεθόδων για τον σκοπό αυτό. Ωστόσο, το περιθώριο λάθους είναι αρκετά μεγάλο εφόσον η επίδοση των μεθόδων εξαρτάται από πολλαπλούς παράγοντες όπως το μήκος των κειμένων, η θεματική συνάφεια μεταξύ των κειμένων και η υφολογική συνάφεια μεταξύ των κειμένων. Η πιο απαιτητική περίπτωση προκύπτει όταν τα κείμενα γνωστού συγγραφέα ανήκουν σε ένα είδος (π.χ. blogs ή μηνύματα email) ενώ το προς διερεύνηση κείμενο ανήκει σε άλλο είδος (π.χ., tweet ή άρθρο εφημερίδας). Επιπλέον, αν τα κείμενα του γνωστού συγγραφέα με το προς διερεύνηση κείμενο δεν συμφωνούν ως προς τη θεματική περιοχή (topic) (π.χ. τα γνωστά κείμενα σχετίζονται με εξωτερική πολιτική και το άγνωστο με πολιτιστικά θέματα) η επίδοση των τρεχόντων μεθόδων επαλήθευσης συγγραφέα είναι ιδιαίτερα χαμηλή. Στόχος της παρούσας διδακτορικής διατριβής είναι η ανάπτυξη αποδοτικών και εύρωστων μεθόδων επαλήθευσης συγγραφέα που είναι ικανές να χειριστούν ακόμα και τέτοιες περίπλοκες περιπτώσεις. Προς την κατεύθυνση αυτή, παρουσιάζουμε βελτιωμένες μεθόδους επαλήθευσης συγγραφέα και συστηματικά εξετάζουμε την αποδοτικότητα τους σε διάφορα σύνολα δεδομένων αναφοράς (PAN datasets και Enron Data). Αρχικά, προτείνουμε δύο βελτιωμένους αλγόριθμους, ο ένας ακολουθεί το παράδειγμα όπου όλα τα διαθέσιμα δείγματα γραφής του υποψηφίου συγγραφέα αντιμετωπίζονται μεμονωμένα, ως ξεχωριστές αναπαραστάσεις (instance-based paradigm) και ο άλλος είναι βασισμένος στο παράδειγμα όπου όλα τα δείγματα γραφής του υποψηφίου συγγραφέα συννενώνονται και εξάγεται ένα ενιαίο κείμενο, μία μοναδική αναπαράσταση (profile-based paradigm), οι οποίες επιτυγχανουν υψηλότερη απόδοση σε σύνολα δεδομένων που καλύπτουν ποικιλία γλωσσώνν (Αγγλικά, Ελληνικά, Ισπανικά, Ολλανδικά) και κειμενικών ειδών (άρθρα, κριτικές, νουβέλες, κ.ά.) σε σύγκριση με την τεχνολογία αιχμής (state-of-the-art) στον τομέα της επαλήθευσης. Είναι σημαντικό να τονίσουμε ότι οι προτεινόμενες μέθοδοι επωφελούνται σημαντικά από τη διαθεσιμότητα πολλαπλών δειγμάτων κειμένων του υποψηφίου συγγραφέα και παραμένουν ιδιαίτερα ανθεκτικές/ανταγωνιστικές όταν το μήκος των κειμένων είναι περιορισμένο. Επιπλέον, διερευνούμε τη χρησιμότητα της εφαρμογής μοντελοποίησης θέματος (topic modeling) στην επαλήθευση συγγραφέα. Συγκεκριμένα, διεξάγουμε μια συστηματική μελέτη για να εξετάσουμε εάν οι τεχνικές μοντελοποίησης θέματος επιτυγχάνουν την βελτίωση της απόδοσης των πιο βασικών κατηγοριών μεθόδων επαλήθευσης καθώς και ποια συγκεκριμένη τεχνική μοντελοποίησης θέματος είναι η πλέον κατάλληλη για κάθε ένα από τα παραδείγματα μεθόδων επαλήθευσης. Για το σκοπό αυτό, συνδυάζουμε γνωστές μεθόδους μοντελοποίσης, Latent Semantic Indexing (LSI) και Latent Dirichlet Allocation, (LDA), με διάφορες μεθόδους επαλήθευσης συγγραφέα, οι οποίες καλύπτουν τις βασικές κατηγορίες στην περιοχή αυτή, δηλαδή την ενδογενή(intrinsic), που αντιμετωπίζει το πρόβλημα επαλήθευσης ως πρόβλημα μίας κλάσης, και την εξωγενή (extrinsic), που μετατρέπει το πρόβλημα επαλήθευσης σε πρόβλημα δύο κλάσεων, σε συνδυασμό με τις profile-based και instance-based προσεγγίσεις.Χρησιμοποιώντας πολλαπλά σύνολα δεδομένων αξιολόγησης επιδεικνύουμε ότι η LDA τεχνική συνδυάζεται καλύτερα με τις εξωγενείς μεθόδους ενώ η τεχνική LSI αποδίδει καλύτερα με την πιο αποδοτικής ενδογενή μέθοδο. Επιπλέον, οι τεχνικές μοντελοποίησης θέματος φαίνεται να είναι πιο αποτελεσματικές όταν εφαρμόζονται σε μεθόδους που ακολουθούν το profile-based παράδειγμα και η αποδοτικότητα τους ενισχύεται όταν η πληροφορία των latent topics εξάγεται από ένα ενισχυμένο σύνολο κειμένων (εμπλουτισμένο με επιπλέον κείμενα τα οποία έχουν συλλεχθεί από εξωτερικές πηγές (π.χ web) και παρουσιάζουν σημαντική θεματική συνάφεια με το αρχικό υπό εξέταση σύνολο δεδομένων. Η σύγκριση των αποτελεσμάτων μας με την τεχνολογία αιχμής του τομέα της επαλήθευσης, επιδεικνύει την δυναμική των προτεινόμενων μεθόδων. Επίσης, οι προτεινόμενες εξωγενείς μέθοδοι είναι ιδιαίτερα ανταγωνιστικές στην περίπτωση που χρησιμοποιηθούν αγνώστου είδους εξωγενή κείμενα. Σε ορισμένες από τις σχετικές μελέτες, υπάρχουν ενδείξεις ότι ετερογενή σύνολα(heterogeneous ensembles) μεθόδων επαλήθευσης μπορούν να παρέχουν πολύ αξιόπιστες λύσεις, καλύτερες από κάθε ατομικό μοντέλο επαλήθευσης ξεχωριστά. Ωστόσο, έχουν εξεταστεί μόνο πολύ απλά μοντέλα συνόλων έως τώρα που συνδυάζουν σχετικά λίγες βασικές μεθόδους. Προσπαθώντας να καλύψουμε το κενό αυτό, θεωρούμε ένα μεγάλο σύνολο βασικών μοντέλων επαλήθευσης (συνολικά 47 μοντέλα) που καλύπτουν τα κύρια παραδείγματα /κατηγορίες μεθόδων στην περιοχή αυτή και μελετούμε τον τρόπο με τον οποίο μπορούν να συνδυαστούν ώστε να δημιουργηθεί ένα αποτελεσματικό σύνολο. Με αυτό τον τρόπο, προτείνουμε ένα απλό σύνολο ομαδοποίησης στοίβας (stacking ensemble) καθώς και μια προσέγγιση που βασίζεται στην δυναμική επιλογή μοντέλων για καθεμία υπό εξέταση περίπτωση επαλήθευσης συγγραφέα ξεχωριστά. Τα πειραματικά αποτελέσματα σε πολλαπλά σύνολα δεδομένων επιβεβαιώνουν την καταλληλότητα των προτεινόμενων μεθόδων επιδεικνύοντας την αποτελεσματικότητα τους. Η βελτίωση της επίδοσης που επιτυγχάνουν τα καλύτερα από τα αναφερόμενα μοντέλα σε σχέση με την τρέχουσα τεχνολογία αιχμής είναι περισσότερο από 10%.

Download Full-text

Comparative Study on Perceived Trust of Topic Modeling Based on Affective Level of Educational Text

Applied Sciences ◽

10.3390/app9214565 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4565 ◽

Cited By ~ 1

Author(s):

Youngjae Im ◽

Jaehyun Park ◽

Minyeong Kim ◽

Kijung Park

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Negative Mood ◽

Ability Test ◽

Perceived Trust ◽

Significant Difference ◽

Traditional Algorithm ◽

Independent Variable ◽

Latent Topics

Latent dirichlet allocation (LDA) is a representative topic model to extract keywords related to latent topics embedded in a document set. Despite its effectiveness in finding underlying topics in documents, the traditional algorithm of LDA does not have a process to reflect sentimental meanings in text for topic extraction. Focusing on this issue, this study aims to investigate the usability of both LDA and sentiment analysis (SA) algorithms based on the affective level of text. This study defines the affective level of a given set of paragraphs and attempts to analyze the perceived trust of the methodologies in regards to usability. In our experiments, the text of the college scholastic ability test was selected as the set of evaluation paragraphs, and the affective level of the paragraphs was manipulated into three levels (low, medium, and high) as an independent variable. The LDA algorithm was used to extract the keywords of the paragraph, while SA was used to identify the positive or negative mood of the extracted subject word. In addition, the perceived trust score of the algorithm was evaluated by the subjects, and this study verifies whether there is a difference in the score according to the affective levels of the paragraphs. The results show that paragraphs with low affect lead to the high perceived trust of LDA from the participants. However, the perceived trust of SA does not show a statistically significant difference between the affect levels. The findings from this study indicate that LDA is more effective to find topics in text that mainly contains objective information.

Download Full-text

How to Identify Hot Topics in Psychology Using Topic Modeling

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000318 ◽

2018 ◽

Vol 226 (1) ◽

pp. 3-13 ◽

Cited By ~ 6

Author(s):

André Bittermann ◽

Andreas Fischer

Keyword(s):

Topic Modeling ◽

English Language ◽

Latent Dirichlet Allocation ◽

Publication Output ◽

Research Trends ◽

Cross Cultural ◽

Cultural Aspects ◽

German Speaking ◽

Latent Topics ◽

Increasing Trends

Abstract. Latent topics and trends in psychological publications were examined to identify hotspots in psychology. Topic modeling was contrasted with a classification-based scientometric approach in order to demonstrate the benefits of the former. Specifically, the psychological publication output in the German-speaking countries containing German- and English-language publications from 1980 to 2016 documented in the PSYNDEX database was analyzed. Topic modeling based on latent Dirichlet allocation (LDA) was applied to a corpus of 314,573 publications. Input for topic modeling was the controlled terms of the publications, that is, a standardized vocabulary of keywords in psychology. Based on these controlled terms, 500 topics were determined and trending topics were identified. Hot topics, indicated by the highest increasing trends in this data, were facets of neuropsychology, online therapy, cross-cultural aspects, traumatization, and visual attention. In conclusion, the findings indicate that topics can reveal more detailed insights into research trends than standardized classifications. Possible applications of this method, limitations, and implications for research synthesis are discussed.

Download Full-text

Exploring the generalizability of discriminant word items and latent topics in online tourist reviews

International Journal of Contemporary Hospitality Management ◽

10.1108/ijchm-10-2015-0597 ◽

2017 ◽

Vol 29 (2) ◽

pp. 803-816 ◽

Cited By ~ 15

Author(s):

Astrid Dickinger ◽

Lidija Lalicic ◽

Josef Mazanec

Keyword(s):

Latent Dirichlet Allocation ◽

Online Reviews ◽

Support Vector ◽

Content Type ◽

Tourism Management ◽

Hospitality And Tourism Management ◽

Vector Machines ◽

Latent Topics ◽

Review Reports ◽

Limited Generalizability

Purpose Online reviews have been gaining relevance in hospitality and tourism management and represent an important research avenue for academia. This study aims to illustrate the discrimination between positive and negative reviews based on single word items and the sector-specific relevance of hidden topics. Design/methodology/approach By probing two parallel approaches of entirely unrelated analytical methods (penalized support vector machines and Latent Dirichlet Allocation), the analysts explore differences in language between favorable and unfavorable reviews in three service settings (hotels, restaurants and attractions). Findings The percentage of correctly predicted positive and negative review reports by means of individual word items does not decrease if reports from the three tourism businesses are analyzed together. Originality/value However, there is limited generalizability of the discriminant words across the three businesses. Also, the latent topics relevant for generating customers’ review reports differ significantly between the three sectors of tourism businesses.

Download Full-text

Entity Profiling to Identify Actor Involvement in Topics of Social Media Content

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.59869 ◽

2020 ◽

Vol 14 (4) ◽

pp. 417

Author(s):

Puji Winar Cahyo ◽

Muhammad Habibi

Keyword(s):

Social Media ◽

Support Vector Machine ◽

Sentiment Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Online News ◽

Support Vector ◽

Media Content ◽

Positive Sentiment ◽

Negative Sentiment

The efficiency of using social media affected modern society's nature and communication; they are more interested in talking through social media than meeting in the real world. The number of talks on social media content depends on the topic being discussed. The more topic interesting will impact the amount of data on social media will be. The data can be analyzed to get the influence of actors (account mentions) on the conversation. The power of an actor can be measured from how often the actor is mentioned in the conversation. This paper aims to conduct entity profiling on social media content to analyze an actor's influence on discussion. Furthermore, using sentiment analysis can determine the sentiment about an actor from a conversation topic. The Latent Dirichlet Allocation (LDA) method is used for analyzes topic modeling, while the Support Vector Machine (SVM) is used for sentiment analysis. This research can show that topics with positive sentiment are more likely to be involved in disaster management accounts, while topics with negative sentiment are more towards involvement in politicians, critics, and online news.

Download Full-text

What is the performance in public hospitals? A longitudinal analysis of performance plans through topic modeling

BMC Health Services Research ◽

10.1186/s12913-021-06332-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Guido Noto ◽

Andrea Carlo Lo Verso ◽

Gustavo Barresi

Keyword(s):

Acute Care ◽

Longitudinal Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Population Based ◽

Public Hospitals ◽

Integration Of Care ◽

Latent Topics ◽

Analysis Of Performance

Abstract Background Both the concept of performance and the role of hospitals in health systems evolved significantly in the last decades. Today, the performance in health could be defined as the ability to create ‘population value,’ and the hospitals’ role is to support this aim by providing acute care and by integrating and coordinating their activity with other settings of care. This research aims to assess how and with what degree the management of public hospitals have embraced in practice the updated concept of performance and their new role. Result The paper analyses 181 performance plans of 48 Italian autonomous public hospitals over a nine-year period through the topic modeling algorithm called Latent Dirichlet Allocation (LDA). This is a method that allows for analysing large textual corpora that generates a representation of the latent topics discussed therein. The concept of performance in public hospitals was framed into 15 topics resulting from the analysis of the hospitals’ performance plans. The prevalence of each topic was analysed through the period considered so as to understand the evolution of performance-related practices over the last decade. Conclusion In recent years, the concept of performance in hospitals evolved toward the adoption of an outcome-based and population-based perspective. Additional effort should be devoted toward improved collaboration and integration of care with other settings.

Download Full-text

Topic Modeling of Committee Discussions in the Brazilian Chamber of Deputies

10.5753/kdmile.2021.17460 ◽

2021 ◽

Author(s):

M. A. dos Santos ◽

N. Andrade ◽

F. Morais

Keyword(s):

Language Processing ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

National Congress ◽

Similarities And Differences ◽

External Events ◽

Latent Topics ◽

Processing Techniques ◽

Chamber Of Deputies

Ensuring that civil society can monitor and supervise the actions of its representatives is essential to build strong democracies. Despite significant advances in transparency, Brazilian National Congress committees are presently complex to follow and monitor due to the lack of open structured data about their discussions and the sheer volume of activity in these committees. This work presents two contributions to this context. First, we create and present an open dataset including structured speeches of the 25 Chamber of Deputies' standing committees over the last two decades. Second, we use Natural Language Processing techniques - especially Latent Dirichlet Allocation (LDA) - to identify themes addressed on these committees. Based on these latent topics, we explore similarities and differences between the standing committees, their relationships, and how their debates change over time. Our results show that committees accommodate conversations - including their main topic and opposing agendas - and describe how the topics discussed in the committees reverberate external events.

Download Full-text

Topic Modeling for Amharic User Generated Texts

Information ◽

10.3390/info12100401 ◽

2021 ◽

Vol 12 (10) ◽

pp. 401

Author(s):

Girma Neshir ◽

Andreas Rauber ◽

Solomon Atnafu

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Sampling Technique ◽

Neural Nets ◽

Supervised Machine Learning ◽

Support Vector ◽

Topic Detection ◽

Learning Tools ◽

Statistical Process

Topic Modeling is a statistical process, which derives the latent themes from extensive collections of text. Three approaches to topic modeling exist, namely, unsupervised, semi-supervised and supervised. In this work, we develop a supervised topic model for an Amharic corpus. We also investigate the effect of stemming on topic detection on Term Frequency Inverse Document Frequency (TF-IDF) features, Latent Dirichlet Allocation (LDA) features and a combination of these two feature sets using four supervised machine learning tools, that is, Support Vector Machine (SVM), Naive Bayesian (NB), Logistic Regression (LR), and Neural Nets (NN). We evaluate our approach using an Amharic corpus of 14,751 documents of ten topic categories. Both qualitative and quantitative analysis of results show that our proposed supervised topic detection outperforms with an accuracy of 88% by SVM using state-of-the-art-approach TF-IDF word features with the application of the Synthetic Minority Over-sampling Technique (SMOTE) and with no stemming operation. The results show that text features with stemming slightly improve the performance of the topic classifier over features with no stemming.

Download Full-text