scholarly journals Transfer learning for topic labeling: Analysis of the UK House of Commons speeches 1935–2014

2021 ◽  
Vol 8 (2) ◽  
pp. 205316802110222
Author(s):  
Hannah Béchara ◽  
Alexander Herzog ◽  
Slava Jankin ◽  
Peter John

Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models require the additional step of attaching meaningful labels to estimated topics, a process that is not scalable, suffers from human bias, and is difficult to replicate. We present a transfer topic labeling method that seeks to remedy these problems, using domain-specific codebooks as the knowledge base to automatically label estimated topics. We demonstrate our approach with a large-scale topic model analysis of the complete corpus of UK House of Commons speeches from 1935 to 2014, using the coding instructions of the Comparative Agendas Project to label topics. We evaluated our results using human expert coding and compared our approach with more current state-of-the-art neural methods. Our approach was simple to implement, compared favorably to expert judgments, and outperformed the neural networks model for a majority of the topics we estimated.

Author(s):  
Zhuang Liu ◽  
Degen Huang ◽  
Kaiyu Huang ◽  
Zhuang Li ◽  
Jun Zhao

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.


Author(s):  
MORITZ OSNABRÜGGE ◽  
SARA B. HOBOLT ◽  
TONI RODON

Research has shown that emotions matter in politics, but we know less about when and why politicians use emotive rhetoric in the legislative arena. This article argues that emotive rhetoric is one of the tools politicians can use strategically to appeal to voters. Consequently, we expect that legislators are more likely to use emotive rhetoric in debates that have a large general audience. Our analysis covers two million parliamentary speeches held in the UK House of Commons and the Irish Parliament. We use a dictionary-based method to measure emotive rhetoric, combining the Affective Norms for English Words dictionary with word-embedding techniques to create a domain-specific dictionary. We show that emotive rhetoric is more pronounced in high-profile legislative debates, such as Prime Minister’s Questions. These findings contribute to the study of legislative speech and political representation by suggesting that emotive rhetoric is used by legislators to appeal directly to voters.


Author(s):  
Arkadipta De ◽  
Dibyanayan Bandyopadhyay ◽  
Baban Gain ◽  
Asif Ekbal

Fake news classification is one of the most interesting problems that has attracted huge attention to the researchers of artificial intelligence, natural language processing, and machine learning (ML). Most of the current works on fake news detection are in the English language, and hence this has limited its widespread usability, especially outside the English literate population. Although there has been a growth in multilingual web content, fake news classification in low-resource languages is still a challenge due to the non-availability of an annotated corpus and tools. This article proposes an effective neural model based on the multilingual Bidirectional Encoder Representations from Transformer (BERT) for domain-agnostic multilingual fake news classification. Large varieties of experiments, including language-specific and domain-specific settings, are conducted. The proposed model achieves high accuracy in domain-specific and domain-agnostic experiments, and it also outperforms the current state-of-the-art models. We perform experiments on zero-shot settings to assess the effectiveness of language-agnostic feature transfer across different languages, showing encouraging results. Cross-domain transfer experiments are also performed to assess language-independent feature transfer of the model. We also offer a multilingual multidomain fake news detection dataset of five languages and seven different domains that could be useful for the research and development in resource-scarce scenarios.


Author(s):  
NGOC TAN LE ◽  
Fatiha Sadat

With the emergence of the neural networks-based approaches, research on information extraction has benefited from large-scale raw texts by leveraging them using pre-trained embeddings and other data augmentation techniques to deal with challenges and issues in Natural Language Processing tasks. In this paper, we propose an approach using sequence-to-sequence neural networks-based models to deal with term extraction for low-resource domain. Our empirical experiments, evaluating on the multilingual ACTER dataset provided in the LREC-TermEval 2020 shared task on automatic term extraction, proved the efficiency of deep learning approach, in the case of low-data settings, for the automatic term extraction task.


2021 ◽  
Vol 25 (1) ◽  
pp. 205-223
Author(s):  
Jin He ◽  
Lei Li ◽  
Yan Wang ◽  
Xindong Wu

With the prevalence of online review websites, large-scale data promote the necessity of focused analysis. This task aims to capture the information that is highly relevant to a specific aspect. However, the broad scope of the aspects of the various products makes this task overarching but challenging. A commonly used solution is to modify the topic models with additional information to capture the features for a specific aspect (referred to as a targeted aspect). However, the existing topic models, either perform the full analysis to capture features as many as possible or estimate the similarity to capture features as coherent as possible, overlook the fine-grained semantic relations between the features, resulting in the captured features coarse and confusing. In this paper, we propose a novel Hierarchical Features-based Topic Model (HFTM) to extract targeted aspects from online reviews, then to capture the aspect-specific features. Specifically, our model can not only capture the direct features posing target-to-feature semantics but also capture the latent features posing feature-to-feature semantics. The experiments conducted on real-world datasets demonstrate that HFTMl outperforms the state-of-the-art baselines in terms of both aspect extraction and document classification.


2021 ◽  
Vol 15 (4) ◽  
pp. e0008755
Author(s):  
David C. Molik ◽  
DeAndre Tomlinson ◽  
Shane Davitt ◽  
Eric L. Morgan ◽  
Matthew Sisk ◽  
...  

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.


AERA Open ◽  
2021 ◽  
Vol 7 ◽  
pp. 233285842110456
Author(s):  
Joshua Littenberg-Tobias ◽  
Elizabeth Borneman ◽  
Justin Reich

Diversity, equity, and inclusion (DEI) issues are urgent in education. We developed and evaluated a massive open online course ( N = 963) with embedded equity simulations that attempted to equip educators with equity teaching practices. Applying a structural topic model (STM)—a type of natural language processing (NLP)—we examined how participants with different equity attitudes responded in simulations. Over a sequence of four simulations, the simulation behavior of participants with less equitable beliefs converged to be more similar with the simulated behavior of participants with more equitable beliefs ( ES [effect size] = 1.08 SD). This finding was corroborated by overall changes in equity mindsets ( ES = 0.88 SD) and changed in self-reported equity-promoting practices ( ES = 0.32 SD). Digital simulations when combined with NLP offer a compelling approach to both teaching about DEI topics and formatively assessing learner behavior in large-scale learning environments.


Author(s):  
Juntao Li ◽  
Ruidan He ◽  
Hai Ye ◽  
Hwee Tou Ng ◽  
Lidong Bing ◽  
...  

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.


2014 ◽  
Vol 2 (1) ◽  
Author(s):  
Justin Reich ◽  
Dustin Tingley ◽  
Jetson Leder-Luis ◽  
Margaret E. Roberts ◽  
Brandon Stewart

Dealing with the vast quantities of text that students generate in Massive Open Online Courses (MOOCs) and other large-scale online learning environments is a daunting challenge. Computational tools are needed to help instructional teams uncover themes and patterns as students write in forums, assignments, and surveys. This paper introduces to the learning analytics community the Structural Topic Model, an approach to language processing that can 1) find syntactic patterns with semantic meaning in unstructured text, 2) identify variation in those patterns across covariates, and 3) uncover archetypal texts that exemplify the documents within a topical pattern. We show examples of computationally aided discovery and reading in three MOOC settings: mapping students’ self-reported motivations, identifying themes in discussion forums, and uncovering patterns of feedback in course evaluations. 


Author(s):  
Chris Perriam ◽  
Darren Waldron

This book advances the current state of film audience research and of our knowledge of sexuality in transnational contexts, by analysing how French LGBTQ films are seen in Spain and Spanish ones in France, as well as how these films are seen in the UK. It studies films from various genres and examines their reception across four languages (Spanish, French, Catalan, English) and engages with participants across a range of digital and physical audience locations. A focus on LGBTQ festivals and on issues relating to LGBTQ experience in both countries allows for the consideration of issues such as ageing, sense of community and isolation, affiliation and investment, and the representation of issues affecting trans people. The book examines films that chronicle the local, national and sub-national identities while also addressing foreign audiences. It draws on a large sample of individual responses through post-screening questionnaires and focus groups as well as on the work of professional film critics and on-line commentators.


Sign in / Sign up

Export Citation Format

Share Document