Analyzing Scientific Publications using Domain-Specific Word Embedding and Topic Modelling

Research has shown that emotions matter in politics, but we know less about when and why politicians use emotive rhetoric in the legislative arena. This article argues that emotive rhetoric is one of the tools politicians can use strategically to appeal to voters. Consequently, we expect that legislators are more likely to use emotive rhetoric in debates that have a large general audience. Our analysis covers two million parliamentary speeches held in the UK House of Commons and the Irish Parliament. We use a dictionary-based method to measure emotive rhetoric, combining the Affective Norms for English Words dictionary with word-embedding techniques to create a domain-specific dictionary. We show that emotive rhetoric is more pronounced in high-profile legislative debates, such as Prime Minister’s Questions. These findings contribute to the study of legislative speech and political representation by suggesting that emotive rhetoric is used by legislators to appeal directly to voters.

Download Full-text

Aspect Categorization Using Domain-Trained Word Embedding and Topic Modelling

Lecture Notes in Electrical Engineering - Advances in Electronics Engineering ◽

10.1007/978-981-15-1289-6_18 ◽

2019 ◽

pp. 191-198

Author(s):

Omar Mustafa Al-Janabi ◽

Nurul Hashimah Ahamed Hassain Malim ◽

Yu-N Cheah

Keyword(s):

Word Embedding ◽

Topic Modelling

Download Full-text

Word Embedding for Small and Domain-specific Malay Corpus

Lecture Notes in Electrical Engineering - Computational Science and Technology ◽

10.1007/978-981-15-0058-9_42 ◽

2020 ◽

pp. 435-443

Author(s):

Sabrina Tiun ◽

Nor Fariza Mohd Nor ◽

Azhar Jalaludin ◽

Anis Nadiah Che Abdul Rahman

Keyword(s):

Word Embedding ◽

Domain Specific

Download Full-text

WHAT CRITICAL THINKING AND FOR WHAT?

SOCIAL WELFARE INTERDISCIPLINARY APPROACH ◽

10.21277/sw.v1i9.460 ◽

2019 ◽

Vol 1 (9) ◽

pp. 24

Author(s):

Valdone Indrasiene ◽

Violeta Jegeleviciene ◽

Odeta Merfeldaitė ◽

Daiva Penkauskiene ◽

Jolanta Pivoriene ◽

...

Keyword(s):

Higher Education ◽

Critical Thinking ◽

Social Competence ◽

Literature Review ◽

Systematic Literature Review ◽

Social Aspects ◽

Scientific Publications ◽

Domain Specific ◽

Research Questions ◽

Concept Change

<p>The article discusses the construction of the critical thinking concept in higher education and its change in scientific publications between 1993 and 2017. Based on a systematic literature review, the following research questions are raised: <em>how does construction of critical thinking concept change in the context of higher education during time? How are personal, interpersonal, and social aspects expressed in the concept of critical thinking in the context of higher education? </em>The systematic literature review revealed significant grow of publications starting from 1998. It is also disclosed slight change in treating critical thinking as purely general or domain-specific competence. The authors of the researched articles do not make clear division between critical thinking as a general and as a domain-specific competence. Researchers in different fields tend to associate critical thinking with the development of a person’s cognitive and intellectual capacities, including skills and attitudes. However, some authors reveal also interpersonal and social aspects of critical thinking. Alas, there are not so many publications in favour of such comprehensive approach. But there is still some hope that critical thinking will be treated and nurtured as personal, interpersonal and social competence.</p>

Download Full-text

QUANTIFYING SEMANTIC SHIFT VISUALLY ON A MALAY DOMAIN-SPECIFIC CORPUS USING TEMPORAL WORD EMBEDDING APPROACH

Asia-Pacific Journal of Information Technology and Multimedia ◽

10.17576/apjitm-2020-0902-01 ◽

2020 ◽

Vol 09 (02) ◽

pp. 1-10

Author(s):

Sabrina Tiun ◽

Saidah Saad ◽

Nor Fariza Mohd Noor ◽

Azhar Jalaludin ◽

Anis Nadiah Che Abdul Rahman

Keyword(s):

Word Embedding ◽

Domain Specific ◽

Semantic Shift

Download Full-text

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

PLoS ONE ◽

10.1371/journal.pone.0243208 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0243208

Author(s):

Leacky Muchene ◽

Wende Safari

Keyword(s):

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Two Stage ◽

Scientific Publications ◽

Statistical Tool ◽

Second Stage ◽

The University ◽

Dirichlet Allocation

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.

Download Full-text

A Text Mining Based Map of Engineering Design: Topics and their Trajectories Over Time

Proceedings of the Design Society: International Conference on Engineering Design ◽

10.1017/dsi.2019.283 ◽

2019 ◽

Vol 1 (1) ◽

pp. 2765-2774 ◽

Cited By ~ 1

Author(s):

Filippo Chiarello ◽

Nicola Melluso ◽

Andrea Bonaccorsi ◽

Gualtiero Fantoni

Keyword(s):

Text Mining ◽

Engineering Design ◽

State Of The Art ◽

Topic Modelling ◽

Specific Topic ◽

Scientific Publications ◽

New Knowledge ◽

New Research ◽

Large Corpus ◽

Over Time

AbstractThe Engineering Design field is growing fast and so is growing the number of sub-fields that are bringing value to researchers that are working in this context. From psychology to neurosciences, from mathematics to machine learning, everyday scholars and practitioners produce new knowledge of potential interest for designers.This leads to complications in the researchers’ aims who want to quickly and easily find literature on a specific topic among a large number of scientific publications or want to effectively position a new research.In the present paper, we address this problem by using state of the art text mining techniques on a large corpus of Engineering Design related documents. In particular, a topic modelling technique is applied to all the papers published in the ICED proceedings from 2003 to 2017 (3,129 documents) in order to find the main subtopics of Engineering Design. Finally, we analyzed the trends of these topics over time, to give a bird-eye view of how the Engineering Design field is evolving.The results offer a clear and bottom-up picture of what Engineering design is and how the interest of researchers in different topics has changed over time.

Download Full-text

A Domain-Specific Generative Chatbot Trained from Little Data

Applied Sciences ◽

10.3390/app10072221 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2221 ◽

Cited By ~ 5

Author(s):

Jurgita Kapočiūtė-Dzikienė

Keyword(s):

Experimental Investigation ◽

English Language ◽

Large Datasets ◽

Word Embedding ◽

Small Data ◽

Domain Specific ◽

Small Domain ◽

Good For ◽

Different Characteristics

Accurate generative chatbots are usually trained on large datasets of question–answer pairs. Despite such datasets not existing for some languages, it does not reduce the need for companies to have chatbot technology in their websites. However, companies usually own small domain-specific datasets (at least in the form of an FAQ) about their products, services, or used technologies. In this research, we seek effective solutions to create generative seq2seq-based chatbots from very small data. Since experiments are carried out in English and morphologically complex Lithuanian languages, we have an opportunity to compare results for languages with very different characteristics. We experimentally explore three encoder–decoder LSTM-based approaches (simple LSTM, stacked LSTM, and BiLSTM), three word embedding types (one-hot encoding, fastText, and BERT embeddings), and five encoder–decoder architectures based on different encoder and decoder vectorization units. Furthermore, all offered approaches are applied to the pre-processed datasets with removed and separated punctuation. The experimental investigation revealed the advantages of the stacked LSTM and BiLSTM encoder architectures and BERT embedding vectorization (especially for the encoder). The best achieved BLUE on English/Lithuanian datasets with removed and separated punctuation was ~0.513/~0.505 and ~0.488/~0.439, respectively. Better results were achieved with the English language, because generating different inflection forms for the morphologically complex Lithuanian is a harder task. The BLUE scores fell into the range defining the quality of the generated answers as good or very good for both languages. This research was performed with very small datasets having little variety in covered topics, which makes this research not only more difficult, but also more interesting. Moreover, to our knowledge, it is the first attempt to train generative chatbots for a morphologically complex language.

Download Full-text

Intersection of Resilience and COVID-19: Structural Topic Modelling and Word Embeddings from Reddit Titles (Preprint)

10.2196/preprints.31457 ◽

2021 ◽

Author(s):

Alejandro Garcia-Rudolph ◽

Blanca Cegarra ◽

Joan Sauri ◽

John D. Kelleher ◽

Katryna Cisek ◽

...

Keyword(s):

Social Media ◽

State Of The Art ◽

R Package ◽

Topic Modelling ◽

Word Embeddings ◽

Scientific Publications ◽

Self Presentation ◽

Semantic Coherence ◽

Preliminary Validation ◽

Media Platform

BACKGROUND Topic modeling and word embeddings’ studies of Twitter data related to COVID-19 are being extensively reported. Another social media platform that experienced a tremendous increase in new users and posts due to COVID-19 was Reddit, offering a much less explored alternative, especially the submissions’ titles, due to their format (≤ 300 characters) and content rules. The positivity of self-presentation on social media has an influence on both the quantity and quality of reactions (upvotes) from other social media contacts. OBJECTIVE 1) Expand on the concept of resilience identifying possible related topics considering their number of upvotes and its closest terms and 2) Associate specific emotions obtained from the state-of-the-art literature to their closest terms in order to relate such emotions to experienced situations. METHODS Reddit data were collected from pushshift.io, with the pushshiftr R package, data cleaning and preprocessing was performed using quanteda, tidyverse, tidytext R packages. A word2vec model (W2V) was trained using submissions’ titles, preliminary validation was performed using a subset of Mikolov’s analogies and a COVID-19 glossary. The W2V model was trained with the wordVectors R package. Main topics (represented as sets of words) using the number of upvotes as covariate were extracted using structural topic modelling (STM) with the spectral methos using the stm R package. Topics validation was performed using semantic coherence and exclusivity. Clusters were assessed using Dunn index. RESULTS We collected all 374,421 titles submitted by 104,351 different redditors to the r/Coronavirus subreddit between January 20th 2020 and 14th May 2021. We trained W2V and identified more than 20 valid analogies (e.g. doctor – hospital + teacher = school). We further validated W2V with representative terms extracted from a COVID-19 glossary, all closest terms retrieved by W2V were verified using state of the art publications. STM retrieved 20 topics (with 20 words each) ordered by their number of upvotes, we run W2V in a representative topic (addressing vaccines) and we used two terms as seeds leading to other related terms (represented using cluster analysis) that we validated using scientific publications. STM did not retrieve any topic containing the term “resilience”, it hardly appeared (less than 0.02%) in all titles. Nevertheless we identified several closest terms (e.g. wellbeing, roadmap) and combined terms (e.g. resilience and elderly, resilience and indigenous) as well as specific emotions that W2V related to lived experiences (e.g. the emotion of gratitude associated to applauses and balconies). CONCLUSIONS We applied for the first time the combination of STM and a word2vec model trained with a relatively small Coronavirus dataset of Reddit titles, leading to immediate and accurate terms that can be used to expand our knowledge on topics associated to the pandemic (e.g. vaccines) or specific aspects such as resilience.

Download Full-text

MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain

Applied Sciences ◽

10.3390/app11136007 ◽

2021 ◽

Vol 11 (13) ◽

pp. 6007

Author(s):

Muzamil Hussain Syed ◽

Sun-Tae Chung

Keyword(s):

Domain Adaptation ◽

Language Model ◽

Named Entity Recognition ◽

Word Embedding ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Feature Vectors ◽

Named Entity ◽

Domain Specific

Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.

Download Full-text