low resource
Recently Published Documents


TOTAL DOCUMENTS

4177
(FIVE YEARS 2762)

H-INDEX

57
(FIVE YEARS 22)

Author(s):  
Fan Xu ◽  
Yangjie Dan ◽  
Keyu Yan ◽  
Yong Ma ◽  
Mingwen Wang

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.


Cytokine ◽  
2022 ◽  
Vol 150 ◽  
pp. 155775
Author(s):  
Emily R. Konrad ◽  
Jeremy Soo ◽  
Andrea L. Conroy ◽  
Sophie Namasopo ◽  
Robert O. Opoka ◽  
...  

Author(s):  
Arkadipta De ◽  
Dibyanayan Bandyopadhyay ◽  
Baban Gain ◽  
Asif Ekbal

Fake news classification is one of the most interesting problems that has attracted huge attention to the researchers of artificial intelligence, natural language processing, and machine learning (ML). Most of the current works on fake news detection are in the English language, and hence this has limited its widespread usability, especially outside the English literate population. Although there has been a growth in multilingual web content, fake news classification in low-resource languages is still a challenge due to the non-availability of an annotated corpus and tools. This article proposes an effective neural model based on the multilingual Bidirectional Encoder Representations from Transformer (BERT) for domain-agnostic multilingual fake news classification. Large varieties of experiments, including language-specific and domain-specific settings, are conducted. The proposed model achieves high accuracy in domain-specific and domain-agnostic experiments, and it also outperforms the current state-of-the-art models. We perform experiments on zero-shot settings to assess the effectiveness of language-agnostic feature transfer across different languages, showing encouraging results. Cross-domain transfer experiments are also performed to assess language-independent feature transfer of the model. We also offer a multilingual multidomain fake news detection dataset of five languages and seven different domains that could be useful for the research and development in resource-scarce scenarios.


Author(s):  
Tharindu Ranasinghe ◽  
Marcos Zampieri

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.


Author(s):  
Ramsha Saeed ◽  
Hammad Afzal ◽  
Haider Abbas ◽  
Maheen Fatima

Increased connectivity has contributed greatly in facilitating rapid access to information and reliable communication. However, the uncontrolled information dissemination has also resulted in the spread of fake news. Fake news might be spread by a group of people or organizations to serve ulterior motives such as political or financial gains or to damage a country’s public image. Given the importance of timely detection of fake news, the research area has intrigued researchers from all over the world. Most of the work for detecting fake news focuses on the English language. However, automated detection of fake news is important irrespective of the language used for spreading false information. Recognizing the importance of boosting research on fake news detection for low resource languages, this work proposes a novel semantically enriched technique to effectively detect fake news in Urdu—a low resource language. A model based on deep contextual semantics learned from the convolutional neural network is proposed. The features learned from the convolutional neural network are combined with other n-gram-based features and are fed to a conventional majority voting ensemble classifier fitted with three base learners: Adaptive Boosting, Gradient Boosting, and Multi-Layer Perceptron. Experiments are performed with different models, and results show that enriching the traditional ensemble learner with deep contextual semantics along with other standard features shows the best results and outperforms the state-of-the-art Urdu fake news detection model.


Author(s):  
Saurabh R. Sangwan ◽  
M. P. S. Bhatia

Cyberspace has been recognized as a conducive environment for use of various hostile, direct, and indirect behavioural tactics to target individuals or groups. Denigration is one of the most frequently used cyberbullying ploys to actively damage, humiliate, and disparage the online reputation of target by sending, posting, or publishing cruel rumours, gossip, and untrue statements. Previous pertinent studies report detecting profane, vulgar, and offensive words primarily in the English language. This research puts forward a model to detect online denigration bullying in low-resource Hindi language using attention residual networks. The proposed model Hindi Denigrate Comment–Attention Residual Network (HDC-ARN) intends to uncover defamatory posts (denigrate comments) written in Hindi language which stake and vilify a person or an entity in public. Data with 942 denigrate comments and 1499 non-denigrate comments is scraped using certain hashtags from two recent trending events in India: Tablighi Jamaat spiked Covid-19 (April 2020, Event 1) and Sushant Singh Rajput Death (June 2020: Event 2). Only text-based features, that is, the actual content of the post, are considered. The pre-trained word embedding for Hindi language from fastText is used. The model has three ResNet blocks with an attention layer that generates a post vector for a single input, which is passed through a sigmoid activation function to get the final output as either denigrate (positive class) or non-denigrate (negative class). An F-1 score of 0.642 is achieved on the dataset.


Sign in / Sign up

Export Citation Format

Share Document