multilingual data
Recently Published Documents


TOTAL DOCUMENTS

36
(FIVE YEARS 15)

H-INDEX

4
(FIVE YEARS 0)

Data ◽  
2022 ◽  
Vol 7 (1) ◽  
pp. 8
Author(s):  
Muhammad Imran ◽  
Umair Qazi ◽  
Ferda Ofli

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.


2021 ◽  
Vol 1 (1) ◽  
pp. 5-18
Author(s):  
Anastasia A. Atabekova

Language management refers to state administrative regulations, policies, and activities on the language(s) use within educational, legal, and other public domains and to the scientific discipline which studies this phenomenon. We argue that during COVID-19 health emergency, the concept of language management might need revision as new topics and contexts have come to light within the discussion on language use amid the current pandemic. We explore key dimensions of this discussion representation in public communication, identify language-use related topics which have been mentioned in this discussion, study its levels and major actors. The texts from official sites of international organizations, national governments, public and non-profit social agencies, mass media were selected. The corpus of 238 sources with a total of 193478 words was subject to manual and computer-based thematic content coding and clustering. The results reveal language-use related topics within the information and discussion topics during the COVID-19, specify the levels at which the above topics discussed, outline those actors who initiate/take part/form the target audience within the discussion on language use during the COVID-19. The research also leads to the conclusion on the critical importance of such issues as the style of international and national leaderships addresses, production and timeliness of multilingual data on the pandemic, countermeasures against misinformation and anti-nation bias, development of protocols for the use of fact-based rational language. The mentioned items are considered as the key components of a language management framework for policy and actions which need a coordinated interagency response within local and global contexts during the COVID-19.


2021 ◽  
Vol 7 ◽  
pp. e775
Author(s):  
Malik Daler Ali Awan ◽  
Nadeem Iqbal Kajla ◽  
Amnah Firdous ◽  
Mujtaba Husnain ◽  
Malik Muhammad Saad Missen

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally i.e., sports, inflation, protest, explosion, and sexual assault, etc. in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency (tf-idf) showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.


2021 ◽  
Vol 24 (5) ◽  
pp. 756-769
Author(s):  
Зинаида Владимировна Апанович

International and Russian-language data sources that provide information about Russian research-related organizations are considered. It is demonstrated that Russian-language data sources contain more information about Russian research-related organizations than most international data sources, but this information remains unavailable for English-language data sources. Experiments on comparison and integration of information about Russian research organizations in international and Russian data sources are outlined. Data sources such as GRID, Russian and English chapters of Wikipedia, Wikidata and eLIBRARY.ru are considered. The work is an intermediate step towards the creation of an open and extensible knowledge graph.


Author(s):  
Asoke Nath ◽  
Debapriya Kandar ◽  
Rahul Gupta

In recent times, with the rise of the internet, everyone is being bombarded with tons of information and data from various sources like websites, blogs and articles, social media posts and comments, e-news portals etc. Now all these data are mostly unstructured. In this paper, the authors have tried to explore the efficiency of the cross-lingual BERT model i.e. M-BERT for text classification and named entity extraction on multilingual data. The authors have used datasets of three different languages namely: French, German and Portuguese to evaluate the model performance.


2021 ◽  
Author(s):  
Linlin Liu ◽  
Bosheng Ding ◽  
Lidong Bing ◽  
Shafiq Joty ◽  
Luo Si ◽  
...  

2021 ◽  
Author(s):  
Vibhu Bhatia ◽  
Vidya Prasad Akavoor ◽  
Sejin Paik ◽  
Lei Guo ◽  
Mona Jalal ◽  
...  

2021 ◽  
Author(s):  
Zinaida Vladimirovna Apanovich

Information about research organizations is an important attribute that enables identifying authors of scientific publications, as well as analyzing the geographical distribution of publications and assessing the impact on the citation of publications associated with a geographic factor. Unfortunately, information on national research-related organizations is often incomplete or distorted in international databases. This applies, in particular, to Russian research organizations represented in English-language databases. The paper presents experiments on data matching and integration about Russian research organizations in multilingual data sources. Data sources such as GRID, Wikipedia, Wikidata and eLIBRARY.ru are considered.


Sign in / Sign up

Export Citation Format

Share Document