multilingual data Latest Research Papers

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.

Download Full-text

Might Covid-19 Require Revision of Language Management?

Human Language, Rights, and Security ◽

10.22363/2713-0614-2021-1-1-5-18 ◽

2021 ◽

Vol 1 (1) ◽

pp. 5-18

Author(s):

Anastasia A. Atabekova

Keyword(s):

Language Use ◽

Scientific Discipline ◽

Language Management ◽

Non Profit ◽

Management Framework ◽

Computer Based ◽

Multilingual Data ◽

Key Dimensions ◽

Rational Language ◽

National Governments

Language management refers to state administrative regulations, policies, and activities on the language(s) use within educational, legal, and other public domains and to the scientific discipline which studies this phenomenon. We argue that during COVID-19 health emergency, the concept of language management might need revision as new topics and contexts have come to light within the discussion on language use amid the current pandemic. We explore key dimensions of this discussion representation in public communication, identify language-use related topics which have been mentioned in this discussion, study its levels and major actors. The texts from official sites of international organizations, national governments, public and non-profit social agencies, mass media were selected. The corpus of 238 sources with a total of 193478 words was subject to manual and computer-based thematic content coding and clustering. The results reveal language-use related topics within the information and discussion topics during the COVID-19, specify the levels at which the above topics discussed, outline those actors who initiate/take part/form the target audience within the discussion on language use during the COVID-19. The research also leads to the conclusion on the critical importance of such issues as the style of international and national leaderships addresses, production and timeliness of multilingual data on the pandemic, countermeasures against misinformation and anti-nation bias, development of protocols for the use of fact-based rational language. The mentioned items are considered as the key components of a language management framework for policy and actions which need a coordinated interagency response within local and global contexts during the COVID-19.

Download Full-text

Event classification from the Urdu language text on social media

PeerJ Computer Science ◽

10.7717/peerj-cs.775 ◽

2021 ◽

Vol 7 ◽

pp. e775

Author(s):

Malik Daler Ali Awan ◽

Nadeem Iqbal Kajla ◽

Amnah Firdous ◽

Mujtaba Husnain ◽

Malik Muhammad Saad Missen

Keyword(s):

Machine Learning ◽

Social Media ◽

Nearest Neighbor ◽

Event Extraction ◽

K Nearest Neighbor ◽

Event Classification ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Multilingual Data ◽

Language Text

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally i.e., sports, inflation, protest, explosion, and sexual assault, etc. in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency (tf-idf) showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.

Download Full-text

Information about Russian Research Organizations in Multilingual Data Sources

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2021-24-5-756-769 ◽

2021 ◽

Vol 24 (5) ◽

pp. 756-769

Author(s):

Зинаида Владимировна Апанович

Keyword(s):

English Language ◽

Russian Language ◽

Data Sources ◽

Knowledge Graph ◽

Intermediate Step ◽

Research Organizations ◽

Language Data ◽

International Data ◽

Multilingual Data ◽

Russian Research

International and Russian-language data sources that provide information about Russian research-related organizations are considered. It is demonstrated that Russian-language data sources contain more information about Russian research-related organizations than most international data sources, but this information remains unavailable for English-language data sources. Experiments on comparison and integration of information about Russian research organizations in international and Russian data sources are outlined. Data sources such as GRID, Russian and English chapters of Wikipedia, Wikidata and eLIBRARY.ru are considered. The work is an intermediate step towards the creation of an open and extensible knowledge graph.

Download Full-text

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

10.21437/interspeech.2021-1565 ◽

2021 ◽

Author(s):

Phat Do ◽

Matt Coler ◽

Jelske Dijkstra ◽

Esther Klabbers

Keyword(s):

Systematic Review ◽

Text To Speech ◽

Low Resource ◽

Multilingual Data

Download Full-text

An Efficient Cross-Lingual BERT Model for Text Classification and Named Entity Extraction in Multilingual Dataset

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217353 ◽

2021 ◽

pp. 280-286

Author(s):

Asoke Nath ◽

Debapriya Kandar ◽

Rahul Gupta

Keyword(s):

Social Media ◽

Text Classification ◽

Model Performance ◽

The Internet ◽

Entity Extraction ◽

Named Entity ◽

Named Entity Extraction ◽

Multilingual Data ◽

The Cross ◽

Cross Lingual

In recent times, with the rise of the internet, everyone is being bombarded with tons of information and data from various sources like websites, blogs and articles, social media posts and comments, e-news portals etc. Now all these data are mostly unstructured. In this paper, the authors have tried to explore the efficiency of the cross-lingual BERT model i.e. M-BERT for text classification and named entity extraction on multilingual data. The authors have used datasets of three different languages namely: French, German and Portuguese to evaluate the model performance.

Download Full-text

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

10.18653/v1/2021.acl-long.453 ◽

2021 ◽

Author(s):

Linlin Liu ◽

Bosheng Ding ◽

Lidong Bing ◽

Shafiq Joty ◽

Luo Si ◽

...

Keyword(s):

Data Augmentation ◽

Low Resource ◽

Multilingual Data ◽

Cross Lingual

Download Full-text

OpenFraming: Open-sourced Tool for Computational Framing Analysis of Multilingual Data

10.18653/v1/2021.emnlp-demo.28 ◽

2021 ◽

Author(s):

Vibhu Bhatia ◽

Vidya Prasad Akavoor ◽

Sejin Paik ◽

Lei Guo ◽

Mona Jalal ◽

...

Keyword(s):

Framing Analysis ◽

Multilingual Data

Download Full-text

Matching and integration of data about Russian research organizations from multilingual data sources

10.20948/abrau-2021-13 ◽

2021 ◽

Author(s):

Zinaida Vladimirovna Apanovich

Keyword(s):

Geographical Distribution ◽

English Language ◽

Data Sources ◽

Scientific Publications ◽

Data Matching ◽

Research Organizations ◽

Geographic Factor ◽

Multilingual Data ◽

The Impact ◽

Russian Research

Information about research organizations is an important attribute that enables identifying authors of scientific publications, as well as analyzing the geographical distribution of publications and assessing the impact on the citation of publications associated with a geographic factor. Unfortunately, information on national research-related organizations is often incomplete or distorted in international databases. This applies, in particular, to Russian research organizations represented in English-language databases. The paper presents experiments on data matching and integration about Russian research organizations in multilingual data sources. Data sources such as GRID, Wikipedia, Wikidata and eLIBRARY.ru are considered.

Download Full-text

Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences

2020 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE) ◽

10.1109/wiecon-ece52138.2020.9398002 ◽

2020 ◽

Author(s):

Mohammad Aman Ullah ◽

Norhidayah Azman ◽

Zulkifly Mohd Zaki ◽

Md. Monirul Islam

Keyword(s):

Social Media ◽

Multilingual Data

Download Full-text

multilingual data
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

Might Covid-19 Require Revision of Language Management?

Event classification from the Urdu language text on social media

Information about Russian Research Organizations in Multilingual Data Sources

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

An Efficient Cross-Lingual BERT Model for Text Classification and Named Entity Extraction in Multilingual Dataset

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

OpenFraming: Open-sourced Tool for Computational Framing Analysis of Multilingual Data

Matching and integration of data about Russian research organizations from multilingual data sources

Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences

Export Citation Format

multilingual dataRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

Might Covid-19 Require Revision of Language Management?

Event classification from the Urdu language text on social media

Information about Russian Research Organizations in Multilingual Data Sources

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

An Efficient Cross-Lingual BERT Model for Text Classification and Named Entity Extraction in Multilingual Dataset

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

OpenFraming: Open-sourced Tool for Computational Framing Analysis of Multilingual Data

Matching and integration of data about Russian research organizations from multilingual data sources

Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences

multilingual data
Recently Published Documents