language identification
Recently Published Documents


TOTAL DOCUMENTS

986
(FIVE YEARS 264)

H-INDEX

30
(FIVE YEARS 5)

Author(s):  
Tharindu Ranasinghe ◽  
Marcos Zampieri

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.


2022 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


Author(s):  
Mainak Biswas ◽  
Saif Rahaman ◽  
Ali Ahmadian ◽  
Kamalularifin Subari ◽  
Pawan Kumar Singh

2021 ◽  
Vol 1 (2) ◽  
pp. 31-44
Author(s):  
Gayane Gasparyan

The article focuses on the transformations, which occur in Russian and Armenian translations of culture-bound constituents in W. Saroyan’s fiction with special reference to the analysis of their pragmatic value and both cross-cultural and cross-language identification. The aim of the analysis is to reveal the so-called Saroyanesque identity and the translation perspectives of his specific manner of reproducing the actual reality, his personal vision of the world he lived in and created in, the world which combined the environment, circumstances, conditions, characters, cultures, ethnicity of two different communities – his native Armenian and no less native America. The so-called double-sided transformations of culture-bound constituents occur in W. Saroyan’s fiction at basically two levels: the cognitive level of ethnic and mental indicators transformations and the linguistic level of culture-bound elements translation (words, phrases, exclamations etc.). To keep Saroyanesque identity the translators should primarily transform the ideas, the concepts, the ethnic mentality of the characters, then the language media should undergo certain pragmatic modification to be correctly interpreted by the target audience.


2021 ◽  
Author(s):  
P. Golda Jeyasheeli ◽  
N. Indumathi

In Indian Population there is about 1 percent of the people are deaf and dumb. Deaf and dumb people use gestures to interact with each other. Ordinary humans fail to grasp the significance of gestures, which makes interaction between deaf and mute people hard. In attempt for ordinary citizens to understand the signs, an automated sign language identification system is proposed. A smart wearable hand device is designed by attaching different sensors to the gloves to perform the gestures. Each gesture has unique sensor values and those values are collected as an excel data. The characteristics of movements are extracted and categorized with the aid of a convolutional neural network (CNN). The data from the test set is identified by the CNN according to the classification. The objective of this system is to bridge the interaction gap between people who are deaf or hard of hearing and the rest of society.


Author(s):  
Rajesh Kumar Mundotiya ◽  
Manish Kumar Singh ◽  
Rahul Kapur ◽  
Swasti Mishra ◽  
Anil Kumar Singh

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.


Sign in / Sign up

Export Citation Format

Share Document