UIMA GRID: Distributed Large-scale Text Analysis

2021 ◽

Vol 40 (3) ◽

Author(s):

Zhiyu Wang ◽

Jingyu Wu ◽

Guang Yu ◽

Zhiping Song

Keyword(s):

Data Analysis ◽

Qing Dynasty ◽

Text Analysis ◽

Large Scale ◽

Liaoning Province ◽

Support Vector ◽

Historical Documents ◽

The Qing Dynasty ◽

Historical Materials ◽

The Government

In traditional historical research, interpreting historical documents subjectively and manually causes problems such as one-sided understanding, selective analysis, and one-way knowledge connection. In this study, we aim to use machine learning to automatically analyze and explore historical documents from a text analysis and visualization perspective. This technology solves the problem of large-scale historical data analysis that is difficult for humans to read and intuitively understand. In this study, we use the historical documents of the Qing Dynasty Hetu Dangse,preserved in the Archives of Liaoning Province, as data analysis samples. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters in the world. Through word frequency analysis, correlation analysis, co-word clustering, word2vec model, and SVM (Support Vector Machines) algorithms, we visualize historical documents, reveal the relationships between functions of the government departments in the Shengjing area of the Qing Dynasty, achieve the automatic classification of historical archives, improve the efficient use of historical materials as well as build connections between historical knowledge. Through this, archivists can be guided practically in historical materials’ management and compilation.

Download Full-text

Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Computer Science and Mathematical Modelling ◽

10.5604/01.3001.0013.1458 ◽

2019 ◽

Vol 0 (8/2018) ◽

pp. 17-28

Author(s):

Maciej Jankowski

Keyword(s):

Text Analysis ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Previous Analysis ◽

Ensemble Methods ◽

Topic Modelling ◽

New Methods ◽

Data Scientist ◽

Dirichlet Allocation

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Download Full-text

Lexicons on Demand: Neural Word Embeddings for Large-Scale Text Analysis

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/677 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ethan Fast ◽

Binbin Chen ◽

Michael S. Bernstein

Keyword(s):

Text Analysis ◽

Large Scale ◽

Data Driven ◽

Word Embeddings ◽

Human Language ◽

Lexical Categories ◽

On Demand ◽

Small Set ◽

Highly Correlated ◽

The Web

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

Download Full-text

Aspects of Discourse Stream Analysis

Global Language Review ◽

10.31703/glr.2019(iv-ii).01 ◽

2019 ◽

Vol IV (II) ◽

pp. 1-6

Author(s):

Mark Perkins

Keyword(s):

Big Data ◽

Real Time ◽

Text Analysis ◽

Large Scale ◽

Relative Speed ◽

Fake News ◽

New Techniques ◽

Subject Orientation ◽

Theory Of Discourse ◽

Theoretical Developments

The huge proliferation of textual (and other data) in digital and organisational sources has led to new techniques of text analysis. The potential thereby unleashed may be underpinned by further theoretical developments to the theory of Discourse Stream Analysis (DSA) as presented here. These include the notion of change in the discourse stream in terms of discourse stream fronts, linguistic elements evolving in real time, and notions of time itself in terms of relative speed, subject orientation and perception. Big data has also given rise to fake news, the manipulation of messages on a large scale. Fake news is conveyed in fake discourse streams and has led to a new field of description and analysis.

Download Full-text

Big Data, Large-Scale Text Analysis, and Public Health Research

American Journal of Public Health ◽

10.2105/ajph.2019.304965 ◽

2019 ◽

Vol 109 (S2) ◽

pp. S126-S127 ◽

Cited By ~ 2

Author(s):

Merlin Chowkwanyun

Keyword(s):

Public Health ◽

Big Data ◽

Text Analysis ◽

Health Research ◽

Large Scale ◽

Public Health Research

Download Full-text

Assessing the implementation of sustainable public procurement using quantitative text-analysis tools: A large-scale analysis of Belgian public procurement notices

Journal of Purchasing and Supply Management ◽

10.1016/j.pursup.2020.100627 ◽

2020 ◽

Vol 26 (4) ◽

pp. 100627 ◽

Cited By ~ 3

Author(s):

J. (Jolien) Grandia ◽

P.M. (Peter) Kruyen

Keyword(s):

Text Analysis ◽

Large Scale ◽

Public Procurement ◽

Scale Analysis ◽

Analysis Tools ◽

Large Scale Analysis ◽

Quantitative Text Analysis

Download Full-text

The individual dynamics of affective expression on social media

EPJ Data Science ◽

10.1140/epjds/s13688-019-0219-3 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 5

Author(s):

Max Pellert ◽

Simon Schweighofer ◽

David Garcia

Keyword(s):

Social Media ◽

Text Analysis ◽

Affective Computing ◽

Large Scale ◽

Temporal Dynamics ◽

Psychological Research ◽

Affective States ◽

Affective Expression ◽

Affective Dynamics ◽

Positive Valence

AbstractUnderstanding the temporal dynamics of affect is crucial for our understanding human emotions in general. In this study, we empirically test a computational model of affective dynamics by analyzing a large-scale dataset of Facebook status updates using text analysis techniques. Our analyses support the central assumptions of our model: After stimulation, affective states, quantified as valence and arousal, exponentially return to an individual-specific baseline. On average, this baseline is at a slightly positive valence value and at a moderate arousal point below the midpoint. Furthermore, affective expression, in this case posting a status update on Facebook, immediately pushes arousal and valence towards the baseline by a proportional value. These results are robust to the choice of the text analysis technique and illustrate the fast timescale of affective dynamics through social media text. These outcomes are of high relevance for affective computing, the detection and modeling of collective emotions, the refinement of psychological research methodology, and the detection of abnormal, and potentially pathological, individual affect dynamics.

Download Full-text

Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges

Annual Review of Political Science ◽

10.1146/annurev-polisci-052615-025542 ◽

2017 ◽

Vol 20 (1) ◽

pp. 529-544 ◽

Cited By ~ 50

Author(s):

John Wilkerson ◽

Andreu Casas

Keyword(s):

Political Science ◽

Text Analysis ◽

Large Scale ◽

Computerized Text Analysis

Download Full-text

Collecting Large-scale Comparative Text Data on Legislative Debates

10.1093/oso/9780198849063.003.0006 ◽

2021 ◽

pp. 91-109

Author(s):

Jan Schwalbach ◽

Christian Rauh

Keyword(s):

Text Analysis ◽

Large Scale ◽

Best Practice ◽

The Political ◽

Extended Version ◽

Sources Of Information ◽

High Quality ◽

Text Data ◽

Best Practice Guidelines ◽

Automated Text Analysis

Parliamentary speeches present one of the most consistently available sources of information about the political priorities, actor positions, and conflict structures in democratic states. Recent advances of automated text analysis offer more and more tools to tap into this information reservoir in a systematic manner. However, collecting the high-quality text data needed for unleashing the comparative potential of the various text analysis algorithms out there is a costly endeavor and faces various pragmatic hurdles. Against this challenge, this chapter offers three contributions. First, we outline best practice guidelines and useful tools for researchers wishing to collect or to extend existing legislative debate corpora. Second, we present an extended version of the ParlSpeech Corpus. Third, we highlight the difficulties of comparing text-as-data outputs across different parliaments, pointing to varying languages, varying traditions and conventions, and varying metadata availability.

Download Full-text