A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages

Author(s):  
Arkadipta De ◽  
Dibyanayan Bandyopadhyay ◽  
Baban Gain ◽  
Asif Ekbal

Fake news classification is one of the most interesting problems that has attracted huge attention to the researchers of artificial intelligence, natural language processing, and machine learning (ML). Most of the current works on fake news detection are in the English language, and hence this has limited its widespread usability, especially outside the English literate population. Although there has been a growth in multilingual web content, fake news classification in low-resource languages is still a challenge due to the non-availability of an annotated corpus and tools. This article proposes an effective neural model based on the multilingual Bidirectional Encoder Representations from Transformer (BERT) for domain-agnostic multilingual fake news classification. Large varieties of experiments, including language-specific and domain-specific settings, are conducted. The proposed model achieves high accuracy in domain-specific and domain-agnostic experiments, and it also outperforms the current state-of-the-art models. We perform experiments on zero-shot settings to assess the effectiveness of language-agnostic feature transfer across different languages, showing encouraging results. Cross-domain transfer experiments are also performed to assess language-independent feature transfer of the model. We also offer a multilingual multidomain fake news detection dataset of five languages and seven different domains that could be useful for the research and development in resource-scarce scenarios.

Author(s):  
Zhuang Liu ◽  
Degen Huang ◽  
Kaiyu Huang ◽  
Zhuang Li ◽  
Jun Zhao

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.


Author(s):  
Ramsha Saeed ◽  
Hammad Afzal ◽  
Haider Abbas ◽  
Maheen Fatima

Increased connectivity has contributed greatly in facilitating rapid access to information and reliable communication. However, the uncontrolled information dissemination has also resulted in the spread of fake news. Fake news might be spread by a group of people or organizations to serve ulterior motives such as political or financial gains or to damage a country’s public image. Given the importance of timely detection of fake news, the research area has intrigued researchers from all over the world. Most of the work for detecting fake news focuses on the English language. However, automated detection of fake news is important irrespective of the language used for spreading false information. Recognizing the importance of boosting research on fake news detection for low resource languages, this work proposes a novel semantically enriched technique to effectively detect fake news in Urdu—a low resource language. A model based on deep contextual semantics learned from the convolutional neural network is proposed. The features learned from the convolutional neural network are combined with other n-gram-based features and are fed to a conventional majority voting ensemble classifier fitted with three base learners: Adaptive Boosting, Gradient Boosting, and Multi-Layer Perceptron. Experiments are performed with different models, and results show that enriching the traditional ensemble learner with deep contextual semantics along with other standard features shows the best results and outperforms the state-of-the-art Urdu fake news detection model.


Author(s):  
G. Purna Chandar Rao ◽  
V. B. Narasimha

A social media adoption is important to provide content authenticity and awareness for the unknown news that might be fake. Therefore, a Natural Language Processing (NLP) model is required to identify the content properties for language-driven feature generation. The present research work utilizes language-driven features that extract the grammatical, sentimental, syntactic, readable features. The feature from the particular news content is extracted to deal with the dimensional problem as the language level features are quite complex. Thus, the Dropout layer-based Long Short Term Network Model (LSTM) for sequential learning achieved better results during fake news detection. The results obtained validate the important features extracted linguistic model features and are combined to achieve better classification accuracy. The proposed Drop out based LSTM model obtained accuracy of 95.3% for fake news classification and detection when compared to the sequential neural model for fake news detection.


Author(s):  
Michael Stewart ◽  
Wei Liu

Knowledge Graph Construction (KGC) from text unlocks information held within unstructured text and is critical to a wide range of downstream applications. General approaches to KGC from text are heavily reliant on the existence of knowledge bases, yet most domains do not even have an external knowledge base readily available. In many situations this results in information loss as a wealth of key information is held within "non-entities". Domain-specific approaches to KGC typically adopt unsupervised pipelines, using carefully crafted linguistic and statistical patterns to extract co-occurred noun phrases as triples, essentially constructing text graphs rather than true knowledge graphs. In this research, for the first time, in the same flavour as Collobert et al.'s seminal work of "Natural language processing (almost) from scratch" in 2011, we propose a Seq2KG model attempting to achieve "Knowledge graph construction (almost) from scratch". An end-to-end Sequence to Knowledge Graph (Seq2KG) neural model jointly learns to generate triples and resolves entity types as a multi-label classification task through deep learning neural networks. In addition, a novel evaluation metric that takes both semantic and structural closeness into account is developed for measuring the performance of triple extraction. We show that our end-to-end Seq2KG model performs on par with a state of the art rule-based system which outperformed other neural models and won the first prize of the first Knowledge Graph Contest in 2019. A new annotation scheme and three high-quality manually annotated datasets are available to help promote this direction of research.


Author(s):  
Juntao Li ◽  
Ruidan He ◽  
Hai Ye ◽  
Hwee Tou Ng ◽  
Lidong Bing ◽  
...  

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.


2021 ◽  
Vol 8 (2) ◽  
pp. 205316802110222
Author(s):  
Hannah Béchara ◽  
Alexander Herzog ◽  
Slava Jankin ◽  
Peter John

Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models require the additional step of attaching meaningful labels to estimated topics, a process that is not scalable, suffers from human bias, and is difficult to replicate. We present a transfer topic labeling method that seeks to remedy these problems, using domain-specific codebooks as the knowledge base to automatically label estimated topics. We demonstrate our approach with a large-scale topic model analysis of the complete corpus of UK House of Commons speeches from 1935 to 2014, using the coding instructions of the Comparative Agendas Project to label topics. We evaluated our results using human expert coding and compared our approach with more current state-of-the-art neural methods. Our approach was simple to implement, compared favorably to expert judgments, and outperformed the neural networks model for a majority of the topics we estimated.


2020 ◽  
Vol 22 (1) ◽  
pp. 32-45
Author(s):  
Emanuela Martina ◽  
Federico Diotallevi ◽  
Tommaso Bianchelli ◽  
Matteo Paolinelli ◽  
Annamaria Offidani

Background: Chronic Spontaneous Urticaria (CSU) is a disease characterized by the onset of wheals and/or angioedema over 6 weeks. The pathophysiology for CSU is very complex, involving mast cells and basophils with a multitude of inflammatory mediators. For many years the treatment of CSU has been based on the use of antihistamines, steroids and immunosuppressive agents with inconstant and frustrating results. The introduction of omalizumab, the only licensed biologic for antihistamine- refractory CSU, has changed the management of the disease. Objective: The aim of this article is to review the current state of the art of CSU, the real-life experience with omalizumab and the promising drugs that are under development. Methods:: An electronic search was performed to identify studies, case reports, guidelines and reviews focused on the new targets for the treatment of chronic spontaneous urticaria, both approved or under investigation. The search was limited to articles published in peer-reviewed journals in the English Language in the PubMed database and trials registered in Clinicaltrials.gov. Results:: Since the advent of omalizumab, the search for new therapies for chronic spontaneous urticaria has had a new impulse. Anti-IgE drugs will probably still be the cornerstone of therapy, but new targets may prove effective in syndromic urticaria or refractory cases. Conclusion:: Although omalizumab has been a breakthrough in the treatment of CSU, many patients do not completely get benefit and even require more effective treatments. Novel drugs are under investigation with promising results.


Author(s):  
Mattson Ogg ◽  
L. Robert Slevc

Music and language are uniquely human forms of communication. What neural structures facilitate these abilities? This chapter conducts a review of music and language processing that follows these acoustic signals as they ascend the auditory pathway from the brainstem to auditory cortex and on to more specialized cortical regions. Acoustic, neural, and cognitive mechanisms are identified where processing demands from both domains might overlap, with an eye to examples of experience-dependent cortical plasticity, which are taken as strong evidence for common neural substrates. Following an introduction describing how understanding musical processing informs linguistic or auditory processing more generally, findings regarding the major components (and parallels) of music and language research are reviewed: pitch perception, syntax and harmonic structural processing, semantics, timbre and speaker identification, attending in auditory scenes, and rhythm. Overall, the strongest evidence that currently exists for neural overlap (and cross-domain, experience-dependent plasticity) is in the brainstem, followed by auditory cortex, with evidence and the potential for overlap becoming less apparent as the mechanisms involved in music and speech perception become more specialized and distinct at higher levels of processing.


Interpreting ◽  
2017 ◽  
Vol 19 (1) ◽  
pp. 1-20 ◽  
Author(s):  
Ena Hodzik ◽  
John N. Williams

We report a study on prediction in shadowing and simultaneous interpreting (SI), both considered as forms of real-time, ‘online’ spoken language processing. The study comprised two experiments, focusing on: (i) shadowing of German head-final sentences by 20 advanced students of German, all native speakers of English; (ii) SI of the same sentences into English head-initial sentences by 22 advanced students of German, again native English speakers, and also by 11 trainee and practising interpreters. Latency times for input and production of the target verbs were measured. Drawing on studies of prediction in English-language reading production, we examined two cues to prediction in both experiments: contextual constraints (semantic cues in the context) and transitional probability (the statistical likelihood of words occurring together in the language concerned). While context affected prediction during both shadowing and SI, transitional probability appeared to favour prediction during shadowing but not during SI. This suggests that the two cues operate on different levels of language processing in SI.


Sign in / Sign up

Export Citation Format

Share Document