UIMA GRID: Distributed Large-scale Text Analysis

Author(s):  
Michael Thomas Egner ◽  
Markus Lorch ◽  
Edd Biddle
Keyword(s):  
2021 ◽  
Vol 40 (3) ◽  
Author(s):  
Zhiyu Wang ◽  
Jingyu Wu ◽  
Guang Yu ◽  
Zhiping Song

In traditional historical research, interpreting historical documents subjectively and manually causes problems such as one-sided understanding, selective analysis, and one-way knowledge connection. In this study, we aim to use machine learning to automatically analyze and explore historical documents from a text analysis and visualization perspective. This technology solves the problem of large-scale historical data analysis that is difficult for humans to read and intuitively understand. In this study, we use the historical documents of the Qing Dynasty Hetu Dangse,preserved in the Archives of Liaoning Province, as data analysis samples. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters in the world. Through word frequency analysis, correlation analysis, co-word clustering, word2vec model, and SVM (Support Vector Machines) algorithms, we visualize historical documents, reveal the relationships between functions of the government departments in the Shengjing area of the Qing Dynasty, achieve the automatic classification of historical archives, improve the efficient use of historical materials as well as build connections between historical knowledge. Through this, archivists can be guided practically in historical materials’ management and compilation.


2019 ◽  
Vol 0 (8/2018) ◽  
pp. 17-28
Author(s):  
Maciej Jankowski

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.


Author(s):  
Ethan Fast ◽  
Binbin Chen ◽  
Michael S. Bernstein

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.


2019 ◽  
Vol IV (II) ◽  
pp. 1-6
Author(s):  
Mark Perkins

The huge proliferation of textual (and other data) in digital and organisational sources has led to new techniques of text analysis. The potential thereby unleashed may be underpinned by further theoretical developments to the theory of Discourse Stream Analysis (DSA) as presented here. These include the notion of change in the discourse stream in terms of discourse stream fronts, linguistic elements evolving in real time, and notions of time itself in terms of relative speed, subject orientation and perception. Big data has also given rise to fake news, the manipulation of messages on a large scale. Fake news is conveyed in fake discourse streams and has led to a new field of description and analysis.


2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Max Pellert ◽  
Simon Schweighofer ◽  
David Garcia

AbstractUnderstanding the temporal dynamics of affect is crucial for our understanding human emotions in general. In this study, we empirically test a computational model of affective dynamics by analyzing a large-scale dataset of Facebook status updates using text analysis techniques. Our analyses support the central assumptions of our model: After stimulation, affective states, quantified as valence and arousal, exponentially return to an individual-specific baseline. On average, this baseline is at a slightly positive valence value and at a moderate arousal point below the midpoint. Furthermore, affective expression, in this case posting a status update on Facebook, immediately pushes arousal and valence towards the baseline by a proportional value. These results are robust to the choice of the text analysis technique and illustrate the fast timescale of affective dynamics through social media text. These outcomes are of high relevance for affective computing, the detection and modeling of collective emotions, the refinement of psychological research methodology, and the detection of abnormal, and potentially pathological, individual affect dynamics.


2021 ◽  
pp. 91-109
Author(s):  
Jan Schwalbach ◽  
Christian Rauh

Parliamentary speeches present one of the most consistently available sources of information about the political priorities, actor positions, and conflict structures in democratic states. Recent advances of automated text analysis offer more and more tools to tap into this information reservoir in a systematic manner. However, collecting the high-quality text data needed for unleashing the comparative potential of the various text analysis algorithms out there is a costly endeavor and faces various pragmatic hurdles. Against this challenge, this chapter offers three contributions. First, we outline best practice guidelines and useful tools for researchers wishing to collect or to extend existing legislative debate corpora. Second, we present an extended version of the ParlSpeech Corpus. Third, we highlight the difficulties of comparing text-as-data outputs across different parliaments, pointing to varying languages, varying traditions and conventions, and varying metadata availability.


Sign in / Sign up

Export Citation Format

Share Document