asian languages
Recently Published Documents





Zulqarnain Nazir ◽  
Khurram Shahzad ◽  
Muhammad Kamran Malik ◽  
Waheed Anwar ◽  
Imran Sarwar Bajwa ◽  

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0261074
Daniel Vujcich ◽  
Meagan Roberts ◽  
Zhihong Gu ◽  
Shih-Chi Kao ◽  
Roanna Lobo ◽  

Background Migrants are underrepresented in population health surveys. Offering translated survey instruments has been shown to increase migrant representation. While ‘team translation’ represents current best practice, there are relatively few published examples describing how it has been implemented. The purpose of this paper is to document the process, results and lessons from a project to translate an English-language sexual health and blood-borne virus survey into Khmer, Karen, Vietnamese and Traditional Chinese. Methods The approach to translation was based on the TRAPD (Translation, Review, Adjudication, Pretesting, and Documentation) model. The English-language survey was sent to two accredited, independent translators. At least one bilingual person was chosen to review and compare the translations and preferred translations were selected through consensus. Agreed translations were pretested with small samples of individuals fluent in the survey language and further revisions made. Results Of the 51 survey questions, only nine resulted in identical independent translations in at least one language. Material differences between the translations related to: (1) the translation of technical terms and medical terminology (e.g. HIV); (2) variations in dialect; and (3) differences in cultural understandings of survey concepts (e.g. committed relationships). Conclusion Survey translation is time-consuming and costly and, as a result, deviations from TRAPD ‘best practice’ occurred. It is not possible to determine whether closer adherence to TRAPD ‘best practice’ would have improved the quality of the resulting translations. However, our study does demonstrate that even adaptations of the TRAPD method can identify issues that may not have been apparent had non-team-based or single-round translation approaches been adopted. Given the dearth of clear empirical evidence about the most accurate and feasible method of undertaking translations, we encourage future researchers to follow our example of making translation data publicly available to enhance transparency and enable critical appraisal.

2021 ◽  
pp. 173-183
Yuan Yichuan ◽  
He Yinhua ◽  
Yuan Yuan ◽  
Zhang Yi

Wolfgang Schweickard

Abstract The article deals with language contact between Italian and South and Southeast Asian languages in the age of the Renaissance. The focus is on South/Southeast Asian lexical elements in Italian travelogues, studies on natural history and missionary reports from the late 15th to the early 17th centuries and their lexicographical treatment.

2021 ◽  
pp. 1-24
Vittrant Alice ◽  
Mouton Léa

Abstract This article focuses on classifiers, one system of the nominal classification domain which is found in Southeast Asian languages. One of the functions associated with classifiers is the categorization of the nominal lexicon according to the semantic characteristics of the referent. Unsurprisingly, classifiers in Southeast Asia are organized around the basic semantic domains of the different systems of nominal classification. Although the system of so-called ‘numeral’ classifiers, whose primary function is to quantify referents, is the best known and most widespread in Southeast Asia, classifiers can encode various functions according to the syntactic constructions in which they appear. In some languages, these morphemes compete with class terms, a second nominal classification system. Sometimes the same form may belong to several paradigms, thus recalling a well-known characteristic of South-East Asian languages: the polyfunctionalty of forms.

PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0256675
Aleksandra Urman ◽  
Justin Chun-ting Ho ◽  
Stefan Katz

Online messaging app Telegram has increased in popularity in recent years surpassing Twitter and Snapchat by the number of active monthly users in late 2020. The messenger has also been crucial to protest movements in several countries in 2019-2020, including Belarus, Russia and Hong Kong. Yet, to date only few studies examined online activities on Telegram and none have analyzed the platform with regard to the protest mobilization. In the present study, we address the existing gap by examining Telegram-based activities related to the 2019 protests in Hong Kong. With this paper we aim to provide an example of methodological tools that can be used to study protest mobilization and coordination on Telegram. We also contribute to the research on computational text analysis in Cantonese—one of the low-resource Asian languages,—as well as to the scholarship on Hong Kong protests and research on social media-based protest mobilization in general. For that, we rely on the data collected through Telegram’s API and a combination of network analysis and computational text analysis. We find that the Telegram-based network was cohesive ensuring efficient spread of protest-related information. Content spread through Telegram predominantly concerned discussions of future actions and protest-related on-site information (i.e., police presence in certain areas). We find that the Telegram network was dominated by different actors each month of the observation suggesting the absence of one single leader. Further, traditional protest leaders—those prominent during the 2014 Umbrella Movement,—such as media and civic organisations were less prominent in the network than local communities. Finally, we observe a cooldown in the level of Telegram activity after the enactment of the harsh National Security Law in July 2020. Further investigation is necessary to assess the persistence of this effect in a long-term perspective.

2021 ◽  
Mary Burke

Language archives connect users such as language communities, linguists, and other researchers, to language data. As the language archiving community develops, concerns have been raised about the ethics, ownership, accessibility, and context of archival materials. While there are no simple solutions to these questions, many language archives are seeking ways to involve language community members in these conversations as they continue. This presentation describes a pilot project undertaken at the Computational Resource for South Asian Languages (CoRSAL) which explores a collaborative archiving approach to enable language community members to tell their own stories by adding contextual information to archival materials.

Sign in / Sign up

Export Citation Format

Share Document