Creating Paraphrase Identification Corpus for Indian Languages

Author(s):  
Anand Kumar M. ◽  
Shivkaran Singh ◽  
Praveena Ramanan ◽  
Vaithehi Sinthiya ◽  
Soman K. P.

In recent times, paraphrase identification task has got the attention of the research community. The paraphrase is a phrase or sentence that conveys the same information but using different words or syntactic structure. The Microsoft Research Paraphrase Corpus (MSRP) is a well-known openly available paraphrase corpus of the English language. There is no such publicly available paraphrase corpus for any Indian language (as of now). This chapter explains the creation of paraphrase corpus for Hindi, Tamil, Malayalam, and Punjabi languages. This is the first publicly available corpus for any Indian language. It was used in the shared task on detecting paraphrases for Indian languages (DPIL) held in conjunction with Forum for Information Retrieval & Evaluation (FIRE) 2016. The annotation process was performed by a postgraduate student followed by a two-step proofreading by a linguist and a language expert.

Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.


ecommerce industries expose public page in the social network site (Facebook, twitter etc) for the intention of improving of business strategy. They extract public mood about the social network page in the forms of total likes, the total share of the page and sentiment of all comments to the social network page similar way celebrities expose public page in the social network sites for the intention of improving its fame. We have developed an assorted model for publicly available page of Facebook. This assorted model is the combination of data extractor model, language convertor and cleaned model, and sentiment analyzer model. Our data extractor model extract comments on all the posts of publicly expose Facebook page in the less span of time. Language convertor and cleaned model would work for conversion of text written in different Indian language to the English language and after that English written text would be cleaned through cleaned model. Language convertor is made after implementing CILTEL model. CILTEL model converts comments written in the Indian languages in the English language. Cleaning model will clean all the comments of all the posts on the Facebook page. Finally, sentiment extraction model will extract sentiments of all the comments of the Facebook page. We have implemented classification using three machine learning algorithm, namely naïve bayes algorithm, perceptron algorithm and rocchio algorithm for checking the performance of our sentiment analysis model. Our assorted sentiment analysis model is beneficial to users like marketing industry, election parties and celebrities


Author(s):  
Pooja P. Walke Et. al.

Translation has always helped India to knit Indians together with respect to its rich culture and literature. Ideas and concepts like ‘Indian ancient literature’,’Indian rich culture’,’Indian philosophy’ and ‘Indian knowledgeable systems’ would have been impossible in the absence of translations with their natural integrationist mission.Machine Translation assist to translate Information presented in one language to other language. Information can be present in form of text, speech and image translating this information helps for sharing of information and ultimately information gain.Translation process is an extremely complex & challenging process. It requires an in-depth knowledge about grammar of both the languages i.e. Source language and Target language to frame the rules for target language generation. Marathi is a regional Indian language and consists of a lot of literature that could be useful if projected in the universal English language. As manual translation is a tedious task, we propose a literature survey about machine translation systems that translates Indian Languages into English Language using various Machine translation approaches like RBMT, SMT, NMT, Hybrid translation


Author(s):  
Barbra A. Meek

This chapter is an exploration of how race and language become entangled in representations and ideas about what it means to be seen and recognized as Native American. Most conceptions of Indianness derive from scholarly European-derived representations and evaluations and from popular narrative media, the one often bootstrapping the other. In tandem, these public manifestations perpetuate the racialization of Indian languages and of Indianness, most ubiquitously in and through a discourse of “blood.” Several ideologies configure the racial logic that determines Indianness: purism (percentage of “Indian blood”), visibility (racialized—and cultural—manifestations of “blood”), continuity (maintenance of a pre-contact “bloodline”), and primitivism (expression of indigenous “blood” in and through language). I argue that this “ideological assemblage” (Kroskrity 2018) undergirds the processes of “racing Indian language(s)” and “languaging an Indian race” (H. Samy Alim 2016) that has resulted in propagating conflicts over and denials of Native American heritage.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2018 ◽  
Vol 7 (2.21) ◽  
pp. 319
Author(s):  
Saini Jacob Soman ◽  
P Swaminathan ◽  
R Anandan ◽  
K Kalaivani

With the developed use of online medium these days for sharing views, sentiments and opinions about products, services, organization and people, micro blogging and social networking sites are acquiring a huge popularity. One of the biggest social media sites namely Twitter is used by several people to share their life events, views and opinion about different areas and concepts. Sentiment analysis is the computational research of reviews, opinions, attitudes, views and peoples’ emotions about different products, services, firms and topics through categorizing them as negative and positive emotions. Sentiment analysis of tweets is a challenging task. This paper makes a critical review on the comparison of the challenges associated with sentiment analysis of Tweets in English Language versus Indian Regional Languages. Five Indian languages namely Tamil, Malayalam, Telugu, Hindi and Bengali have been considered in this research and several challenges associated with the analysis of Twitter sentiments in those languages have been identified and conceptualized in the form of a framework in this research through systematic review.  


In this paper, a new stemmer has been proposed named as “Root based stemmer”. This stemmer is strictly based on Dravidian script. Stemming can be used to pick up the effectiveness of information retrieval. In proposed Root based stemming technique, each and every token is compared against with all the words of a valid root words dictionary until a match is found. Then extract the matched string or substring from a token and identified as valid root. The present work is aimed to build dictionary based stemmer to extract valid root words for Indian languages especially for Telugu and compare the results with existing stemmers.


2020 ◽  
Vol 8 (1) ◽  
Author(s):  
Antonio Maconi ◽  
Mariateresa Dacquino ◽  
Federica Viazzi ◽  
Emanuela Bovo ◽  
Federica Grosso ◽  
...  

Objectives: The aim of this paper is to demonstrate how, while remaining within a specific field such as medicine, it is possible to use different languages depending on the target audience (doctors, professionals from other fields or patients) in order to improve its degree of health literacy. In particular, the aim is to show how even the definition of a disease, which should in principle be unambiguous, can in fact be linguistically adapted to the reader's basic knowledge. Methodology: Five definitions of mesothelioma are examined, analysed lexically, syntactically and graphically. Specifically, this comparison is made on three main levels, which in turn have different nuances: popular, including definitions from Wikipedia and the UK Mesothelioma patient portal; intermediate, corresponding to the Collins English language dictionary; and specialist, with definitions from the MeSH thesaurus and the Orphanet database. Results: At the end of the comparative analysis, it is possible to state that in linguistic and Health Literacy terms there is no single definition for this rare disease but as many definitions as there are targets. In particular, they vary in syntactic structure, graphic form and vocabulary, as they have to use technicalities typical of the medical field but have different nuances of complexity. Conclusion: A comparison of the definitions shows that the degree of readability does not always correspond to that of comprehensibility. The analysis demonstrates that it is difficult to explain complex medical concepts to practitioners and patients in a simple, clear and usable way and that this requires specific techniques of Health Literacy, related to both the linguistic and graphic aspects. The comparison of definitions is therefore a methodological premise for the creation of brochures dedicated to mesothelioma and the revision of the "Mai soli" site for mesothelioma patients.


Sign in / Sign up

Export Citation Format

Share Document