"The Power of Words in Big Data: Ngrams, Mega-Text Corpora, and Computer-Automated Text Analysis"

2014 ◽  
Vol 2014 (1) ◽  
pp. 13878
Author(s):  
Eric Abrahamson
2020 ◽  
Vol 34 (1) ◽  
pp. 19-42
Author(s):  
David Moats

It is often claimed that the rise of so called ‘big data’ and computationally advanced methods may exacerbate tensions between disciplines like data science and anthropology. This paper is an attempt to reflect on these possible tensions and their resolution, empirically. It contributes to a growing body of literature which observes interdisciplinary collabrations around new methods and digital infrastructures in practice but argues that many existing arrangements for interdisciplinary collaboration enforce a separation between disciplines in which identities are not really put at risk. In order to disrupt these standard roles and routines we put on a series of workshops in which mainly self-identified qualitative or non-technical researchers were encouraged to use digital tools (scrapers, automated text analysis and data visualisations). The paper focuses on three empirical examples from the workshops in which tensions, both between disciplines and methods, flared up and how they were ultimately managed or settled. In order to characterise both these tensions and negotiating strategies I draw on Woolgar and Stengers’ use of the humour and irony to describe how disciplines relate to each others truth claims. I conclude that while there is great potential in more open-ended collaborative settings, qualitative social scientists may need to confront some of their own disciplinary baggage in order for better dialogue and more radical mixings between disciplines to occur.


2017 ◽  
Author(s):  
Erik de Vries ◽  
Martijn Schoonvelde ◽  
Gijs Schumacher

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al., 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models – such as topic models. We use the europarl dataset and compare term-document matrices as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find term-document matrices for both text corpora to be highly similar, with significant but minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regards to LDA topic models, we find topical prevalence and topical content to be highly similar with only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.


2018 ◽  
Vol 26 (4) ◽  
pp. 417-430 ◽  
Author(s):  
Erik de Vries ◽  
Martijn Schoonvelde ◽  
Gijs Schumacher

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.


2017 ◽  
Vol 233 ◽  
pp. 111-136 ◽  
Author(s):  
Kyle Jaros ◽  
Jennifer Pan

AbstractXi Jinping's rise to power in late 2012 brought immediate political realignments in China, but the extent of these shifts has remained unclear. In this paper, we evaluate whether the perceived changes associated with Xi Jinping's ascent – increased personalization of power, centralization of authority, Party dominance and anti-Western sentiment – were reflected in the content of provincial-level official media. As past research makes clear, media in China have strong signalling functions, and media coverage patterns can reveal which actors are up and down in politics. Applying innovations in automated text analysis to nearly two million newspaper articles published between 2011 and 2014, we identify and tabulate the individuals and organizations appearing in official media coverage in order to help characterize political shifts in the early years of Xi Jinping's leadership. We find substantively mixed and regionally varied trends in the media coverage of political actors, qualifying the prevailing picture of China's “new normal.” Provincial media coverage reflects increases in the personalization and centralization of political authority, but we find a drop in the media profile of Party organizations and see uneven declines in the media profile of foreign actors. More generally, we highlight marked variation across provinces in coverage trends.


2018 ◽  
Vol 46 (1) ◽  

Damian Trilling & Jelle Boumans Automated analysis of Dutch language-based texts. An overview and research agenda While automated methods of content analysis are increasingly popular in today’s communication research, these methods have hardly been adopted by communication scholars studying texts in Dutch. This essay offers an overview of the possibilities and current limitations of automated text analysis approaches in the context of the Dutch language. Particularly in dictionary-based approaches, research is far less prolific as research on the English language. We divide the most common types of content-analytical research questions into three categories: 1) research problems for which automated methods ought to be used, 2) research problems for which automated methods could be used, and 3) research problems for which automated methods (currently) cannot be used. Finally, we give suggestions for the advancement of automated text analysis approaches for Dutch texts. Keywords: automated content analysis, Dutch, dictionaries, supervised machine learning, unsupervised machine learning


Author(s):  
A S Mukhin ◽  
I A Rytsarev ◽  
R A Paringer ◽  
A V Kupriyanov ◽  
D V Kirsh

The article is devoted to the definition of such groups in social networks. The object of the study was selected data social network Vk. Text data was collected, processed and analyzed. To solve the problem of obtaining the necessary information, research was conducted in the field of optimization of data collection of the social network Vk. A software tool that provides the collection and subsequent processing of the necessary data from the specified resources has been developed. The existing algorithms of text analysis, mainly of large volume, were investigated and applied.


Sign in / Sign up

Export Citation Format

Share Document