language identification Latest Research Papers

Multilingual Offensive Language Identification for Low-resource Languages

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3457610 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-13

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Transfer Learning ◽

Hate Speech ◽

Training And Development ◽

Language Identification ◽

Shared Task ◽

Low Resource ◽

Government Organizations ◽

Cross Lingual ◽

Offensive Language ◽

Clear Majority

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

HULTIG-C: NLP Corpus and Services in the Cloud

10.21203/rs.3.rs-696114/v2 ◽

2022 ◽

Author(s):

Sebastião Pais ◽

João Cordeiro ◽

Muhammad Jamil

Keyword(s):

Question Answering ◽

Named Entity Recognition ◽

Language Identification ◽

Entity Recognition ◽

Corpus Annotation ◽

Named Entity ◽

Corpus Construction ◽

Corpus Creation ◽

Specialized Texts ◽

Main Components

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.

Automatic spoken language identification using MFCC based time series features

Multimedia Tools and Applications ◽

10.1007/s11042-021-11439-1 ◽

2022 ◽

Author(s):

Mainak Biswas ◽

Saif Rahaman ◽

Ali Ahmadian ◽

Kamalularifin Subari ◽

Pawan Kumar Singh

Keyword(s):

Time Series ◽

Spoken Language ◽

Language Identification

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Lecture Notes in Networks and Systems - Advances in Distributed Computing and Machine Learning ◽

10.1007/978-981-16-4807-6_17 ◽

2022 ◽

pp. 174-185

Author(s):

Shreya Jain ◽

Kanika Agarwal

Keyword(s):

Language Identification ◽

Word Level

Double-Sided Transformations of Culture-Bound Constituents in William Saroyan’s Cross-Cultural Domain

Translation Studies: Theory and Practice ◽

10.46991/tstp/2021.1.2.031 ◽

2021 ◽

Vol 1 (2) ◽

pp. 31-44

Author(s):

Gayane Gasparyan

Keyword(s):

Cross Cultural ◽

Language Identification ◽

Native America ◽

Cognitive Level ◽

Actual Reality ◽

The World ◽

Specific Manner ◽

Cultural Domain ◽

Linguistic Level ◽

Cross Language

The article focuses on the transformations, which occur in Russian and Armenian translations of culture-bound constituents in W. Saroyan’s fiction with special reference to the analysis of their pragmatic value and both cross-cultural and cross-language identification. The aim of the analysis is to reveal the so-called Saroyanesque identity and the translation perspectives of his specific manner of reproducing the actual reality, his personal vision of the world he lived in and created in, the world which combined the environment, circumstances, conditions, characters, cultures, ethnicity of two different communities – his native Armenian and no less native America. The so-called double-sided transformations of culture-bound constituents occur in W. Saroyan’s fiction at basically two levels: the cognitive level of ethnic and mental indicators transformations and the linguistic level of culture-bound elements translation (words, phrases, exclamations etc.). To keep Saroyanesque identity the translators should primarily transform the ideas, the concepts, the ethnic mentality of the characters, then the language media should undergo certain pragmatic modification to be correctly interpreted by the target audience.

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

SN Computer Science ◽

10.1007/s42979-021-00977-y ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Charangan Vasantharajan ◽

Uthayasanker Thayasivam

Keyword(s):

Language Identification ◽

Offensive Language

Deep Learning Based Indian Sign Language Words Identification System

10.3233/apc210272 ◽

2021 ◽

Author(s):

P. Golda Jeyasheeli ◽

N. Indumathi

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Sign Language ◽

Language Identification ◽

Identification System ◽

Test Set ◽

The People ◽

Indian Sign Language ◽

Smart Wearable

In Indian Population there is about 1 percent of the people are deaf and dumb. Deaf and dumb people use gestures to interact with each other. Ordinary humans fail to grasp the significance of gestures, which makes interaction between deaf and mute people hard. In attempt for ordinary citizens to understand the signs, an automated sign language identification system is proposed. A smart wearable hand device is designed by attaching different sensors to the gloves to perform the gestures. Each gesture has unique sensor values and those values are collected as an excel data. The characteristics of movements are extracted and categorized with the aid of a convolutional neural network (CNN). The data from the test set is identified by the CNN according to the classification. The objective of this system is to bridge the interaction gap between people who are deaf or hard of hearing and the rest of society.

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3458250 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-37

Author(s):

Rajesh Kumar Mundotiya ◽

Manish Kumar Singh ◽

Rahul Kapur ◽

Swasti Mishra ◽

Anil Kumar Singh

Keyword(s):

Language Processing ◽

Language Identification ◽

Identification Algorithm ◽

Additional Contribution ◽

Low Resource ◽

Pos Tagging ◽

High Resource ◽

Language Technology ◽

Statistical Measures ◽

Corpus Size

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.

Spoken Language Identification System for Kashmiri and Related Languages Using Mel-Spectrograms and Deep Learning Approach

10.1109/icsc53193.2021.9673212 ◽

2021 ◽

Author(s):

Irshad Ahmad Thukroo ◽

Rumaan Bashir

Keyword(s):

Deep Learning ◽

Spoken Language ◽

Language Identification ◽

Identification System ◽

Learning Approach

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

10.1007/978-981-16-3690-5_73 ◽

2021 ◽

pp. 795-802

Author(s):

Ramchandra Joshi ◽

Raviraj Joshi

Keyword(s):

Language Identification

language identification
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Multilingual Offensive Language Identification for Low-resource Languages

HULTIG-C: NLP Corpus and Services in the Cloud

Automatic spoken language identification using MFCC based time series features

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Double-Sided Transformations of Culture-Bound Constituents in William Saroyan’s Cross-Cultural Domain

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

Deep Learning Based Indian Sign Language Words Identification System

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

Spoken Language Identification System for Kashmiri and Related Languages Using Mel-Spectrograms and Deep Learning Approach

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

Export Citation Format

language identificationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Multilingual Offensive Language Identification for Low-resource Languages

HULTIG-C: NLP Corpus and Services in the Cloud

Automatic spoken language identification using MFCC based time series features

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Double-Sided Transformations of Culture-Bound Constituents in William Saroyan’s Cross-Cultural Domain

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

Deep Learning Based Indian Sign Language Words Identification System

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

Spoken Language Identification System for Kashmiri and Related Languages Using Mel-Spectrograms and Deep Learning Approach

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

language identification
Recently Published Documents