Building the Tatar-Russian NMT System Based on Re-translation of Multilingual Data

Author(s):  
Aidar Khusainov ◽  
Dzhavdet Suleymanov ◽  
Rinat Gilmullin ◽  
Ajrat Gatiatullin
Keyword(s):  
Author(s):  
Jan Žižka ◽  
František Dařena

Gaining new and keeping existing clients or customers can be well-supported by creating and monitoring feedbacks: “Are the customers satisfied? Can we improve our services?” One of possible feedbacks is allowing the customers to freely write their reviews using a simple textual form. The more reviews that are available, the better knowledge can be acquired and applied to improving the service. However, very large data generated by collecting the reviews has to be processed automatically as humans usually cannot manage it within an acceptable time. The main question is “Can a computer reveal an opinion core hidden in text reviews?” It is a challenging task because the text is written in a natural language. This chapter presents a method based on the automatic extraction of expressions that are significant for specifying a review attitude to a given topic. The significant expressions are composed using significant words revealed in the documents. The significant words are selected by a decision-tree generator based on the entropy minimization. Words included in branches represent kernels of the significant expressions. The full expressions are composed of the significant words and words surrounding them in the original documents. The results are here demonstrated using large real-world multilingual data representing customers' opinions concerning hotel accommodation booked on-line, and Internet shopping. Knowledge discovered in the reviews may subsequently serve for various marketing tasks.


2021 ◽  
Vol 1 (1) ◽  
pp. 5-18
Author(s):  
Anastasia A. Atabekova

Language management refers to state administrative regulations, policies, and activities on the language(s) use within educational, legal, and other public domains and to the scientific discipline which studies this phenomenon. We argue that during COVID-19 health emergency, the concept of language management might need revision as new topics and contexts have come to light within the discussion on language use amid the current pandemic. We explore key dimensions of this discussion representation in public communication, identify language-use related topics which have been mentioned in this discussion, study its levels and major actors. The texts from official sites of international organizations, national governments, public and non-profit social agencies, mass media were selected. The corpus of 238 sources with a total of 193478 words was subject to manual and computer-based thematic content coding and clustering. The results reveal language-use related topics within the information and discussion topics during the COVID-19, specify the levels at which the above topics discussed, outline those actors who initiate/take part/form the target audience within the discussion on language use during the COVID-19. The research also leads to the conclusion on the critical importance of such issues as the style of international and national leaderships addresses, production and timeliness of multilingual data on the pandemic, countermeasures against misinformation and anti-nation bias, development of protocols for the use of fact-based rational language. The mentioned items are considered as the key components of a language management framework for policy and actions which need a coordinated interagency response within local and global contexts during the COVID-19.


2021 ◽  
Vol 7 ◽  
pp. e775
Author(s):  
Malik Daler Ali Awan ◽  
Nadeem Iqbal Kajla ◽  
Amnah Firdous ◽  
Mujtaba Husnain ◽  
Malik Muhammad Saad Missen

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally i.e., sports, inflation, protest, explosion, and sexual assault, etc. in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency (tf-idf) showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.


Author(s):  
Samuel Thomas ◽  
Kartik Audhkhasi ◽  
Jia Cui ◽  
Brian Kingsbury ◽  
Bhuvana Ramabhadran

Author(s):  
Barbara E. Bullock ◽  
Almeida Jacqueline Toribio ◽  
Jacqueline Serigos ◽  
Gualberto Guzmán
Keyword(s):  

Data ◽  
2022 ◽  
Vol 7 (1) ◽  
pp. 8
Author(s):  
Muhammad Imran ◽  
Umair Qazi ◽  
Ferda Ofli

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.


2021 ◽  
Vol 24 (5) ◽  
pp. 756-769
Author(s):  
Зинаида Владимировна Апанович

International and Russian-language data sources that provide information about Russian research-related organizations are considered. It is demonstrated that Russian-language data sources contain more information about Russian research-related organizations than most international data sources, but this information remains unavailable for English-language data sources. Experiments on comparison and integration of information about Russian research organizations in international and Russian data sources are outlined. Data sources such as GRID, Russian and English chapters of Wikipedia, Wikidata and eLIBRARY.ru are considered. The work is an intermediate step towards the creation of an open and extensible knowledge graph.


Pragmatics ◽  
2001 ◽  
Vol 11 (3) ◽  
pp. 285-307 ◽  
Author(s):  
Shi-xu ◽  
Manfred Kienpointner

Discourse and communication approaches to culture have traditionally been concerned with the role of language in (mis)representing cultures. But how text and talk reproduce and transform cultures is just beginning to be understood. Proceeding from the view that cultural creation, development and transformation are constituted in and through situated discursive practice, this study explores the interconnections between argumentative discourse and cultural reproduction. The research is based on multinational and multilingual data of journalistic communication on Hong Kong’s historic transition. It is shown that the causes of Hong Kong’s economic success, as an important cultural feature, are used as arguments to undermine contrary claims. It is also revealed that the future development of Hong Kong is being constrained by the argument ad baculum. In addition, it is observed that Hong Kong’s identities are used as bases for prescribing desired course of action. Finally, these argumentative strategies are re-examined in their broader historical and cultural context in order to show how Hong Kong’s past, present and future are cultural realities bound up with Western desire and power.


Sign in / Sign up

Export Citation Format

Share Document