offensive language
Recently Published Documents


TOTAL DOCUMENTS

215
(FIVE YEARS 173)

H-INDEX

9
(FIVE YEARS 4)

Author(s):  
Tharindu Ranasinghe ◽  
Marcos Zampieri

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.


2022 ◽  
Vol 5 (1) ◽  
pp. 9
Author(s):  
Junjie Liu ◽  
Yong Yang ◽  
Xiaochao Fan ◽  
Ge Ren ◽  
Liang Yang ◽  
...  

The rapid identification of offensive language in social media is of great significance for preventing viral spread and reducing the spread of malicious information, such as cyberbullying and content related to self-harm. In existing research, the public datasets of offensive language are small; the label quality is uneven; and the performance of the pre-trained models is not satisfactory. To overcome these problems, we proposed a multi-semantic fusion model based on data augmentation (MSF). Data augmentation was carried out by back translation so that it reduced the impact of too-small datasets on performance. At the same time, we used a novel fusion mechanism that combines word-level semantic features and n-grams character features. The experimental results on the two datasets showed that the model proposed in this study can effectively extract the semantic information of offensive language and achieve state-of-the-art performance on both datasets.


2021 ◽  
Vol 6 (2) ◽  
Author(s):  
Christine Riggle ◽  
Mary Samouelian

Inclusive and conscious archival description can support consistency in researching and describing marginalized groups and can serve to provide context and a counter-narrative reflecting the perspective of the documented community. It can also help to address the power imbalances between creators and subjects of records. In this article, the authors describe efforts to prepare best practice guidelines for inclusive description and for revising descriptions to remediate outdated, problematic, or offensive language and meet modern standards. They also share how the project team is working together to create meaningful and enduring changes that both provide a better experience for staff and users and support Harvard Business School’s Action Plan for Racial Equality.


2021 ◽  
pp. 312-317
Author(s):  
Rina Zviel-Girshin ◽  
Tanara Zingano Kuhn ◽  
Ana R. Luís ◽  
Kristina Koppel ◽  
Branislava Šandrih Todorović ◽  
...  

Despite the unquestionable academic interest on corpus-based approaches to language education, the use of corpora by teachers in their everyday practice is still not very widespread. One way to promote usage of corpora in language teaching is by making pedagogically appropriate corpora, labelled with different types of problems (for instance, sensitive content, offensive language, structural problems), so that teachers can select authentic examples according to their needs. Because manually labelling corpora is extremely time-consuming, we propose to use crowdsourcing for this task. After a first exploratory phase, we are currently developing a multimode, multilanguage game in which players first identify problematic sentences and then classify them.


Author(s):  
Vildan Mercan ◽  
Akhtar Jamil ◽  
Alaa Ali Hameed ◽  
Irfan Ahmed Magsi ◽  
Sibghatullah Bazai ◽  
...  

Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 418
Author(s):  
Daniela America da Silva ◽  
Henrique Duarte Borges Louro ◽  
Gildarcio Sousa Goncalves ◽  
Johnny Cardoso Marques ◽  
Luiz Alberto Vieira Dias ◽  
...  

In recent years, we have seen a wide use of Artificial Intelligence (AI) applications in the Internet and everywhere. Natural Language Processing and Machine Learning are important sub-fields of AI that have made Chatbots and Conversational AI applications possible. Those algorithms are built based on historical data in order to create language models, however historical data could be intrinsically discriminatory. This article investigates whether a Conversational AI could identify offensive language and it will show how large language models often produce quite a bit of unethical behavior because of bias in the historical data. Our low-level proof-of-concept will present the challenges to detect offensive language in social media and it will discuss some steps to propitiate strong results in the detection of offensive language and unethical behavior using a Conversational AI.


Sign in / Sign up

Export Citation Format

Share Document