Creating Paraphrase Identification Corpus for Indian Languages

In recent times, paraphrase identification task has got the attention of the research community. The paraphrase is a phrase or sentence that conveys the same information but using different words or syntactic structure. The Microsoft Research Paraphrase Corpus (MSRP) is a well-known openly available paraphrase corpus of the English language. There is no such publicly available paraphrase corpus for any Indian language (as of now). This chapter explains the creation of paraphrase corpus for Hindi, Tamil, Malayalam, and Punjabi languages. This is the first publicly available corpus for any Indian language. It was used in the shared task on detecting paraphrases for Indian languages (DPIL) held in conjunction with Forum for Information Retrieval & Evaluation (FIRE) 2016. The annotation process was performed by a postgraduate student followed by a two-step proofreading by a linguist and a language expert.

Download Full-text

NATIVE LANGUAGE IDENTIFICATION FOR RUSSIAN USING ERRORS TYPES

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-1123-1133 ◽

2020 ◽

Author(s):

N. V. Remnev ◽

Keyword(s):

Research Group ◽

English Language ◽

Native Language ◽

Language Identification ◽

Data Driven ◽

Shared Task ◽

Identification Task ◽

Learner Corpus ◽

Russian Speakers ◽

Russian Research

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.

Download Full-text

Assorted Sentiment Model for Publically Available Page of Facebook

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b7739.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 1160-1167

Keyword(s):

Social Network ◽

Sentiment Analysis ◽

Business Strategy ◽

English Language ◽

Learning Algorithm ◽

Indian Languages ◽

Analysis Model ◽

Indian Language ◽

The Social ◽

Bayes Algorithm

ecommerce industries expose public page in the social network site (Facebook, twitter etc) for the intention of improving of business strategy. They extract public mood about the social network page in the forms of total likes, the total share of the page and sentiment of all comments to the social network page similar way celebrities expose public page in the social network sites for the intention of improving its fame. We have developed an assorted model for publicly available page of Facebook. This assorted model is the combination of data extractor model, language convertor and cleaned model, and sentiment analyzer model. Our data extractor model extract comments on all the posts of publicly expose Facebook page in the less span of time. Language convertor and cleaned model would work for conversion of text written in different Indian language to the English language and after that English written text would be cleaned through cleaned model. Language convertor is made after implementing CILTEL model. CILTEL model converts comments written in the Indian languages in the English language. Cleaning model will clean all the comments of all the posts on the Facebook page. Finally, sentiment extraction model will extract sentiments of all the comments of the Facebook page. We have implemented classification using three machine learning algorithm, namely naïve bayes algorithm, perceptron algorithm and rocchio algorithm for checking the performance of our sentiment analysis model. Our assorted sentiment analysis model is beneficial to users like marketing industry, election parties and celebrities

Download Full-text

A Survey on “Machine translation Approaches for Indian Languages”

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.1941 ◽

2021 ◽

Vol 12 (3) ◽

pp. 4792-4794

Author(s):

Pooja P. Walke Et. al.

Keyword(s):

Machine Translation ◽

English Language ◽

Indian Philosophy ◽

Literature Survey ◽

Target Language ◽

Indian Languages ◽

Indian Language ◽

Language Generation ◽

Source Language ◽

Translation Systems

Translation has always helped India to knit Indians together with respect to its rich culture and literature. Ideas and concepts like ‘Indian ancient literature’,’Indian rich culture’,’Indian philosophy’ and ‘Indian knowledgeable systems’ would have been impossible in the absence of translations with their natural integrationist mission.Machine Translation assist to translate Information presented in one language to other language. Information can be present in form of text, speech and image translating this information helps for sharing of information and ultimately information gain.Translation process is an extremely complex & challenging process. It requires an in-depth knowledge about grammar of both the languages i.e. Source language and Target language to frame the rules for target language generation. Marathi is a regional Indian language and consists of a lot of literature that could be useful if projected in the universal English language. As manual translation is a tedious task, we propose a literature survey about machine translation systems that translates Indian Languages into English Language using various Machine translation approaches like RBMT, SMT, NMT, Hybrid translation

Download Full-text

Racing Indian Language, Languaging an Indian Race

The Oxford Handbook of Language and Race ◽

10.1093/oxfordhb/9780190845995.013.20 ◽

2020 ◽

pp. 367-397

Author(s):

Barbra A. Meek

Keyword(s):

Native American ◽

The Other ◽

Indian Languages ◽

Indian Language ◽

Popular Narrative ◽

American Heritage ◽

The One ◽

Indian Blood

This chapter is an exploration of how race and language become entangled in representations and ideas about what it means to be seen and recognized as Native American. Most conceptions of Indianness derive from scholarly European-derived representations and evaluations and from popular narrative media, the one often bootstrapping the other. In tandem, these public manifestations perpetuate the racialization of Indian languages and of Indianness, most ubiquitously in and through a discourse of “blood.” Several ideologies configure the racial logic that determines Indianness: purism (percentage of “Indian blood”), visibility (racialized—and cultural—manifestations of “blood”), continuity (maintenance of a pre-contact “bloodline”), and primitivism (expression of indigenous “blood” in and through language). I argue that this “ideological assemblage” (Kroskrity 2018) undergirds the processes of “racing Indian language(s)” and “languaging an Indian race” (H. Samy Alim 2016) that has resulted in propagating conflicts over and denials of Native American heritage.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

The constituent object parser: syntactic structure matching for information retrieval

ACM SIGIR Forum ◽

10.1145/75335.75348 ◽

1989 ◽

Vol 23 (SI) ◽

pp. 117-126

Author(s):

D. P. Metzler ◽

S. W. Haas

Keyword(s):

Information Retrieval ◽

Syntactic Structure

Download Full-text

A comparative review of the challenges encountered in sentiment analysis of Indian regional language tweets vs English language tweets

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.21.12394 ◽

2018 ◽

Vol 7 (2.21) ◽

pp. 319

Author(s):

Saini Jacob Soman ◽

P Swaminathan ◽

R Anandan ◽

K Kalaivani

Keyword(s):

Sentiment Analysis ◽

Life Events ◽

Social Networking Sites ◽

Positive Emotions ◽

English Language ◽

Indian Languages ◽

Regional Language ◽

Comparative Review ◽

Computational Research ◽

Regional Languages

With the developed use of online medium these days for sharing views, sentiments and opinions about products, services, organization and people, micro blogging and social networking sites are acquiring a huge popularity. One of the biggest social media sites namely Twitter is used by several people to share their life events, views and opinion about different areas and concepts. Sentiment analysis is the computational research of reviews, opinions, attitudes, views and peoples’ emotions about different products, services, firms and topics through categorizing them as negative and positive emotions. Sentiment analysis of tweets is a challenging task. This paper makes a critical review on the comparison of the challenges associated with sentiment analysis of Tweets in English Language versus Indian Regional Languages. Five Indian languages namely Tamil, Malayalam, Telugu, Hindi and Bengali have been considered in this research and several challenges associated with the analysis of Twitter sentiments in those languages have been identified and conceptualized in the form of a framework in this research through systematic review.

Download Full-text

The constituent object parser: syntactic structure matching for information retrieval

Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '89 ◽

10.1145/75334.75348 ◽

1989 ◽

Author(s):

D. P. Metzler ◽

S. W. Haas

Keyword(s):

Information Retrieval ◽

Syntactic Structure

Download Full-text

Root Based Stemmer for Telugu Script

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8734.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 2565-2568

Keyword(s):

Information Retrieval ◽

Indian Languages

In this paper, a new stemmer has been proposed named as “Root based stemmer”. This stemmer is strictly based on Dravidian script. Stemming can be used to pick up the effectiveness of information retrieval. In proposed Root based stemming technique, each and every token is compared against with all the words of a valid root words dictionary until a match is found. Then extract the matched string or substring from a token and identified as valid root. The present work is aimed to build dictionary based stemmer to extract valid root words for Indian languages especially for Telugu and compare the results with existing stemmers.

Download Full-text

Mesothelioma: one disease, many definitions. Comparative linguistic analysis in a “health literacy” perspective

Working Paper of Public Health ◽

10.4081/wpph.2020.9242 ◽

2020 ◽

Vol 8 (1) ◽

Author(s):

Antonio Maconi ◽

Mariateresa Dacquino ◽

Federica Viazzi ◽

Emanuela Bovo ◽

Federica Grosso ◽

...

Keyword(s):

Comparative Analysis ◽

Health Literacy ◽

English Language ◽

Syntactic Structure ◽

Basic Knowledge ◽

Patient Portal ◽

Medical Field ◽

Definition Of ◽

The Uk ◽

Medical Concepts

Objectives: The aim of this paper is to demonstrate how, while remaining within a specific field such as medicine, it is possible to use different languages depending on the target audience (doctors, professionals from other fields or patients) in order to improve its degree of health literacy. In particular, the aim is to show how even the definition of a disease, which should in principle be unambiguous, can in fact be linguistically adapted to the reader's basic knowledge. Methodology: Five definitions of mesothelioma are examined, analysed lexically, syntactically and graphically. Specifically, this comparison is made on three main levels, which in turn have different nuances: popular, including definitions from Wikipedia and the UK Mesothelioma patient portal; intermediate, corresponding to the Collins English language dictionary; and specialist, with definitions from the MeSH thesaurus and the Orphanet database. Results: At the end of the comparative analysis, it is possible to state that in linguistic and Health Literacy terms there is no single definition for this rare disease but as many definitions as there are targets. In particular, they vary in syntactic structure, graphic form and vocabulary, as they have to use technicalities typical of the medical field but have different nuances of complexity. Conclusion: A comparison of the definitions shows that the degree of readability does not always correspond to that of comprehensibility. The analysis demonstrates that it is difficult to explain complex medical concepts to practitioners and patients in a simple, clear and usable way and that this requires specific techniques of Health Literacy, related to both the linguistic and graphic aspects. The comparison of definitions is therefore a methodological premise for the creation of brochures dedicated to mesothelioma and the revision of the "Mai soli" site for mesothelioma patients.

Download Full-text