No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al., 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models – such as topic models. We use the europarl dataset and compare term-document matrices as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find term-document matrices for both text corpora to be highly similar, with significant but minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regards to LDA topic models, we find topical prevalence and topical content to be highly similar with only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

Download Full-text

No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

Political Analysis ◽

10.1017/pan.2018.26 ◽

2018 ◽

Vol 26 (4) ◽

pp. 417-430 ◽

Cited By ~ 29

Author(s):

Erik de Vries ◽

Martijn Schoonvelde ◽

Gijs Schumacher

Keyword(s):

Machine Translation ◽

Text Analysis ◽

Gold Standard ◽

Topic Model ◽

Topic Models ◽

Bag Of Words ◽

Text Corpora ◽

Automated Text Analysis ◽

Lost In Translation

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

Download Full-text

Improving Text Analysis Using Sentence Conjunctions and Punctuation

Marketing Science ◽

10.1287/mksc.2019.1214 ◽

2020 ◽

Vol 39 (4) ◽

pp. 727-742 ◽

Cited By ~ 1

Author(s):

Joachim Büschken ◽

Greg M. Allenby

Keyword(s):

Text Analysis ◽

Latent Variable ◽

Topic Model ◽

Topic Models ◽

Future Research ◽

Multiple Data ◽

Variable Approach ◽

Multiple Data Sets ◽

Carry Over ◽

High Level

User-generated content in the form of customer reviews, blogs, and tweets is an emerging and rich source of data for marketers. Topic models have been successfully applied to such data, demonstrating that empirical text analysis benefits greatly from a latent variable approach that summarizes high-level interactions among words. We propose a new topic model that allows for serial dependency of topics in text. That is, topics may carry over from word to word in a document, violating the bag-of-words assumption in traditional topic models. In the proposed model, topic carryover is informed by sentence conjunctions and punctuation. Typically, such observed information is eliminated prior to analyzing text data (i.e., preprocessing) because words such as “and” and “but” do not differentiate topics. We find that these elements of grammar contain information relevant to topic changes. We examine the performance of our models using multiple data sets and establish boundary conditions for when our model leads to improved inference about customer evaluations. Implications and opportunities for future research are discussed.

Download Full-text

Can topic models be used in research evaluations? Reproducibility, validity, and reliability when compared with semantic maps

Research Evaluation ◽

10.1093/reseval/rvz015 ◽

2019 ◽

Vol 28 (3) ◽

pp. 263-272 ◽

Cited By ~ 1

Author(s):

Tobias Hecking ◽

Loet Leydesdorff

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Research Evaluation ◽

Topic Models ◽

Principal Component ◽

Evaluation Framework ◽

Validity And Reliability ◽

Text Corpora ◽

Semantic Maps ◽

Semantic Coherence

AbstractWe replicate and analyze the topic model which was commissioned to King’s College and Digital Science for the Research Evaluation Framework (REF 2014) in the United Kingdom: 6,638 case descriptions of societal impact were submitted by 154 higher-education institutes. We compare the Latent Dirichlet Allocation (LDA) model with Principal Component Analysis (PCA) of document-term matrices using the same data. Since topic models are almost by definition applied to text corpora which are too large to read, validation of the results of these models is hardly possible; furthermore the models are irreproducible for a number of reasons. However, removing a small fraction of the documents from the sample—a test for reliability—has on average a larger impact in terms of decay on LDA than on PCA-based models. The semantic coherence of LDA models outperforms PCA-based models. In our opinion, results of the topic models are statistical and should not be used for grant selections and micro decision-making about research without follow-up using domain-specific semantic maps.

Download Full-text

"The Power of Words in Big Data: Ngrams, Mega-Text Corpora, and Computer-Automated Text Analysis"

Academy of Management Proceedings ◽

10.5465/ambpp.2014.13878symposium ◽

2014 ◽

Vol 2014 (1) ◽

pp. 13878

Author(s):

Eric Abrahamson

Keyword(s):

Big Data ◽

Text Analysis ◽

Text Corpora ◽

Automated Text Analysis ◽

Power Of Words

Download Full-text

AN IMPROVED PATTERN BASED LDA TOPIC MODELING FOR BUSINESS INTELLIGENCE

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i2.363 ◽

2021 ◽

Vol 9 (2) ◽

pp. 404-409

Author(s):

K Prashant Gokul, Et. al.

Keyword(s):

Topic Modeling ◽

Information Needs ◽

Pattern Mining ◽

Topic Model ◽

Information Filtering ◽

Topic Models ◽

Model Learning ◽

Text Corpora ◽

Exploratory Data ◽

Made In

Topic models give a helpful strategy to dimensionality decrease and exploratory data analysis in huge text corpora. Most ways to deal with topic model learning have been founded on a greatest likelihood objective. Proficient algorithms exist that endeavor to inexact this target, yet they have no provable certifications. As of late, algorithms have been presented that give provable limits, however these algorithms are not down to earth since they are wasteful and not hearty to infringement of model presumptions. In this work, we propose to consolidate the statistical topic modeling with pattern mining strategies to produce pattern-based topic models to upgrade the semantic portrayals of the conventional word based topic models. Using the proposed pattern-based topic model, clients' inclinations can be modeled with different topics and every one of which is addressed with semantically rich patterns. A tale information filtering model is proposed here. In information filtering model client information needs are made in terms of different topics where every topic is addressed by patterns. The calculation produces results similar to the best executions while running significant degrees quicker.

Download Full-text

China's Newsmakers: Official Media Coverage and Political Shifts in the Xi Jinping Era

The China Quarterly ◽

10.1017/s0305741017001679 ◽

2017 ◽

Vol 233 ◽

pp. 111-136 ◽

Cited By ~ 11

Author(s):

Kyle Jaros ◽

Jennifer Pan

Keyword(s):

Text Analysis ◽

Media Coverage ◽

Past Research ◽

Early Years ◽

Xi Jinping ◽

Political Actors ◽

Party Organizations ◽

Automated Text Analysis ◽

The Media ◽

Party Dominance

AbstractXi Jinping's rise to power in late 2012 brought immediate political realignments in China, but the extent of these shifts has remained unclear. In this paper, we evaluate whether the perceived changes associated with Xi Jinping's ascent – increased personalization of power, centralization of authority, Party dominance and anti-Western sentiment – were reflected in the content of provincial-level official media. As past research makes clear, media in China have strong signalling functions, and media coverage patterns can reveal which actors are up and down in politics. Applying innovations in automated text analysis to nearly two million newspaper articles published between 2011 and 2014, we identify and tabulate the individuals and organizations appearing in official media coverage in order to help characterize political shifts in the early years of Xi Jinping's leadership. We find substantively mixed and regionally varied trends in the media coverage of political actors, qualifying the prevailing picture of China's “new normal.” Provincial media coverage reflects increases in the personalization and centralization of political authority, but we find a drop in the media profile of Party organizations and see uneven declines in the media profile of foreign actors. More generally, we highlight marked variation across provinces in coverage trends.

Download Full-text

Automated Text Analysis: Cautionary Tales

Literary and Linguistic Computing ◽

10.1093/llc/9.4.295 ◽

1994 ◽

Vol 9 (4) ◽

pp. 295-302 ◽

Cited By ~ 15

Author(s):

C. N. BALL

Keyword(s):

Text Analysis ◽

Automated Text Analysis ◽

Cautionary Tales

Download Full-text

Summaries

Tijdschrift voor Communicatiewetenschappen ◽

10.5117/2018.046.001.007 ◽

2018 ◽

Vol 46 (1) ◽

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Text Analysis ◽

English Language ◽

Automated Analysis ◽

Supervised Machine Learning ◽

Automated Text Analysis ◽

Research Problems ◽

Dutch Language ◽

Automated Methods

Damian Trilling & Jelle Boumans Automated analysis of Dutch language-based texts. An overview and research agenda While automated methods of content analysis are increasingly popular in today’s communication research, these methods have hardly been adopted by communication scholars studying texts in Dutch. This essay offers an overview of the possibilities and current limitations of automated text analysis approaches in the context of the Dutch language. Particularly in dictionary-based approaches, research is far less prolific as research on the English language. We divide the most common types of content-analytical research questions into three categories: 1) research problems for which automated methods ought to be used, 2) research problems for which automated methods could be used, and 3) research problems for which automated methods (currently) cannot be used. Finally, we give suggestions for the advancement of automated text analysis approaches for Dutch texts. Keywords: automated content analysis, Dutch, dictionaries, supervised machine learning, unsupervised machine learning

Download Full-text