scholarly journals No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

Author(s):  
Erik de Vries ◽  
Martijn Schoonvelde ◽  
Gijs Schumacher

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al., 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models – such as topic models. We use the europarl dataset and compare term-document matrices as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find term-document matrices for both text corpora to be highly similar, with significant but minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regards to LDA topic models, we find topical prevalence and topical content to be highly similar with only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

2018 ◽  
Vol 26 (4) ◽  
pp. 417-430 ◽  
Author(s):  
Erik de Vries ◽  
Martijn Schoonvelde ◽  
Gijs Schumacher

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.


2020 ◽  
Vol 39 (4) ◽  
pp. 727-742 ◽  
Author(s):  
Joachim Büschken ◽  
Greg M. Allenby

User-generated content in the form of customer reviews, blogs, and tweets is an emerging and rich source of data for marketers. Topic models have been successfully applied to such data, demonstrating that empirical text analysis benefits greatly from a latent variable approach that summarizes high-level interactions among words. We propose a new topic model that allows for serial dependency of topics in text. That is, topics may carry over from word to word in a document, violating the bag-of-words assumption in traditional topic models. In the proposed model, topic carryover is informed by sentence conjunctions and punctuation. Typically, such observed information is eliminated prior to analyzing text data (i.e., preprocessing) because words such as “and” and “but” do not differentiate topics. We find that these elements of grammar contain information relevant to topic changes. We examine the performance of our models using multiple data sets and establish boundary conditions for when our model leads to improved inference about customer evaluations. Implications and opportunities for future research are discussed.


2019 ◽  
Vol 28 (3) ◽  
pp. 263-272 ◽  
Author(s):  
Tobias Hecking ◽  
Loet Leydesdorff

AbstractWe replicate and analyze the topic model which was commissioned to King’s College and Digital Science for the Research Evaluation Framework (REF 2014) in the United Kingdom: 6,638 case descriptions of societal impact were submitted by 154 higher-education institutes. We compare the Latent Dirichlet Allocation (LDA) model with Principal Component Analysis (PCA) of document-term matrices using the same data. Since topic models are almost by definition applied to text corpora which are too large to read, validation of the results of these models is hardly possible; furthermore the models are irreproducible for a number of reasons. However, removing a small fraction of the documents from the sample—a test for reliability—has on average a larger impact in terms of decay on LDA than on PCA-based models. The semantic coherence of LDA models outperforms PCA-based models. In our opinion, results of the topic models are statistical and should not be used for grant selections and micro decision-making about research without follow-up using domain-specific semantic maps.


2021 ◽  
Vol 9 (2) ◽  
pp. 404-409
Author(s):  
K Prashant Gokul, Et. al.

Topic models give a helpful strategy to dimensionality decrease and exploratory data analysis in huge text corpora. Most ways to deal with topic model learning have been founded on a greatest likelihood objective. Proficient algorithms exist that endeavor to inexact this target, yet they have no provable certifications. As of late, algorithms have been presented that give provable limits, however these algorithms are not down to earth since they are wasteful and not hearty to infringement of model presumptions. In this work, we propose to consolidate the statistical topic modeling with pattern mining strategies to produce pattern-based topic models to upgrade the semantic portrayals of the conventional word based topic models. Using the proposed pattern-based topic model, clients' inclinations can be modeled with different topics and every one of which is addressed with semantically rich patterns. A tale information filtering model is proposed here. In information filtering model client information needs are made in terms of different topics where every topic is addressed by patterns. The calculation produces results similar to the best executions while running significant degrees quicker.


2017 ◽  
Vol 233 ◽  
pp. 111-136 ◽  
Author(s):  
Kyle Jaros ◽  
Jennifer Pan

AbstractXi Jinping's rise to power in late 2012 brought immediate political realignments in China, but the extent of these shifts has remained unclear. In this paper, we evaluate whether the perceived changes associated with Xi Jinping's ascent – increased personalization of power, centralization of authority, Party dominance and anti-Western sentiment – were reflected in the content of provincial-level official media. As past research makes clear, media in China have strong signalling functions, and media coverage patterns can reveal which actors are up and down in politics. Applying innovations in automated text analysis to nearly two million newspaper articles published between 2011 and 2014, we identify and tabulate the individuals and organizations appearing in official media coverage in order to help characterize political shifts in the early years of Xi Jinping's leadership. We find substantively mixed and regionally varied trends in the media coverage of political actors, qualifying the prevailing picture of China's “new normal.” Provincial media coverage reflects increases in the personalization and centralization of political authority, but we find a drop in the media profile of Party organizations and see uneven declines in the media profile of foreign actors. More generally, we highlight marked variation across provinces in coverage trends.


2018 ◽  
Vol 46 (1) ◽  

Damian Trilling & Jelle Boumans Automated analysis of Dutch language-based texts. An overview and research agenda While automated methods of content analysis are increasingly popular in today’s communication research, these methods have hardly been adopted by communication scholars studying texts in Dutch. This essay offers an overview of the possibilities and current limitations of automated text analysis approaches in the context of the Dutch language. Particularly in dictionary-based approaches, research is far less prolific as research on the English language. We divide the most common types of content-analytical research questions into three categories: 1) research problems for which automated methods ought to be used, 2) research problems for which automated methods could be used, and 3) research problems for which automated methods (currently) cannot be used. Finally, we give suggestions for the advancement of automated text analysis approaches for Dutch texts. Keywords: automated content analysis, Dutch, dictionaries, supervised machine learning, unsupervised machine learning


Sign in / Sign up

Export Citation Format

Share Document