scholarly journals QUANTIFYING SEMANTIC SHIFT VISUALLY ON A MALAY DOMAIN-SPECIFIC CORPUS USING TEMPORAL WORD EMBEDDING APPROACH

Author(s):  
Sabrina Tiun ◽  
Saidah Saad ◽  
Nor Fariza Mohd Noor ◽  
Azhar Jalaludin ◽  
Anis Nadiah Che Abdul Rahman
Author(s):  
MORITZ OSNABRÜGGE ◽  
SARA B. HOBOLT ◽  
TONI RODON

Research has shown that emotions matter in politics, but we know less about when and why politicians use emotive rhetoric in the legislative arena. This article argues that emotive rhetoric is one of the tools politicians can use strategically to appeal to voters. Consequently, we expect that legislators are more likely to use emotive rhetoric in debates that have a large general audience. Our analysis covers two million parliamentary speeches held in the UK House of Commons and the Irish Parliament. We use a dictionary-based method to measure emotive rhetoric, combining the Affective Norms for English Words dictionary with word-embedding techniques to create a domain-specific dictionary. We show that emotive rhetoric is more pronounced in high-profile legislative debates, such as Prime Minister’s Questions. These findings contribute to the study of legislative speech and political representation by suggesting that emotive rhetoric is used by legislators to appeal directly to voters.


Author(s):  
Sabrina Tiun ◽  
Nor Fariza Mohd Nor ◽  
Azhar Jalaludin ◽  
Anis Nadiah Che Abdul Rahman

2020 ◽  
Vol 10 (7) ◽  
pp. 2221 ◽  
Author(s):  
Jurgita Kapočiūtė-Dzikienė

Accurate generative chatbots are usually trained on large datasets of question–answer pairs. Despite such datasets not existing for some languages, it does not reduce the need for companies to have chatbot technology in their websites. However, companies usually own small domain-specific datasets (at least in the form of an FAQ) about their products, services, or used technologies. In this research, we seek effective solutions to create generative seq2seq-based chatbots from very small data. Since experiments are carried out in English and morphologically complex Lithuanian languages, we have an opportunity to compare results for languages with very different characteristics. We experimentally explore three encoder–decoder LSTM-based approaches (simple LSTM, stacked LSTM, and BiLSTM), three word embedding types (one-hot encoding, fastText, and BERT embeddings), and five encoder–decoder architectures based on different encoder and decoder vectorization units. Furthermore, all offered approaches are applied to the pre-processed datasets with removed and separated punctuation. The experimental investigation revealed the advantages of the stacked LSTM and BiLSTM encoder architectures and BERT embedding vectorization (especially for the encoder). The best achieved BLUE on English/Lithuanian datasets with removed and separated punctuation was ~0.513/~0.505 and ~0.488/~0.439, respectively. Better results were achieved with the English language, because generating different inflection forms for the morphologically complex Lithuanian is a harder task. The BLUE scores fell into the range defining the quality of the generated answers as good or very good for both languages. This research was performed with very small datasets having little variety in covered topics, which makes this research not only more difficult, but also more interesting. Moreover, to our knowledge, it is the first attempt to train generative chatbots for a morphologically complex language.


2021 ◽  
Vol 11 (13) ◽  
pp. 6007
Author(s):  
Muzamil Hussain Syed ◽  
Sun-Tae Chung

Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.


Information ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 10
Author(s):  
Christofer Meinecke ◽  
Ahmad Dawar Hakimi ◽  
Stefan Jänicke

Detecting references and similarities in music lyrics can be a difficult task. Crowdsourced knowledge platforms such as Genius. can help in this process through user-annotated information about the artist and the song but fail to include visualizations to help users find similarities and structures on a higher and more abstract level. We propose a prototype to compute similarities between rap artists based on word embedding of their lyrics crawled from Genius. Furthermore, the artists and their lyrics can be analyzed using an explorative visualization system applying multiple visualization methods to support domain-specific tasks.


2019 ◽  
Vol 34 (Supplement_1) ◽  
pp. i110-i122
Author(s):  
Susan Leavy ◽  
Mark T Keane ◽  
Emilie Pine

AbstractIndustrial Memories is a digital humanities initiative to supplement close readings of a government report with new distant readings, using text analytics techniques. The Ryan Report (2009), the official report of the Commission to Inquire into Child Abuse (CICA), details the systematic abuse of thousands of children from 1936 to 1999 in residential institutions run by religious orders and funded and overseen by the Irish State. Arguably, the sheer size of the Ryan Report—over 1 million words—warrants a new approach that blends close readings to witness its findings, with distant readings that help surface system-wide findings embedded in the Report. Although CICA has been lauded internationally for its work, many have critiqued the narrative form of the Ryan Report, for obfuscating key findings and providing poor systemic, statistical summaries that are crucial to evaluating the political and cultural context in which the abuse took place (Keenan, 2013, Child Sexual Abuse and the Catholic Church: Gender, Power, and Organizational Culture. Oxford University Press). In this article, we concentrate on describing the distant reading methodology we adopted, using machine learning and text-analytic methods and report on what they surfaced from the Report. The contribution of this work is threefold: (i) it shows how text analytics can be used to surface new patterns, summaries and results that were not apparent via close reading, (ii) it demonstrates how machine learning can be used to annotate text by using word embedding to compile domain-specific semantic lexicons for feature extraction and (iii) it demonstrates how digital humanities methods can be applied to an official state inquiry with social justice impact.


Sign in / Sign up

Export Citation Format

Share Document