Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM

Unsupervised topic extraction is a vital step in automatically extracting concise contentual information from large text corpora. Existing topic extraction methods lack the capability of linking relations between these topics which would further help text understanding. Therefore we propose utilizing the Decomposition into Directional Components (DEDICOM) algorithm which provides a uniquely interpretable matrix factorization for symmetric and asymmetric square matrices and tensors. We constrain DEDICOM to row-stochasticity and non-negativity in order to factorize pointwise mutual information matrices and tensors of text corpora. We identify latent topic clusters and their relations within the vocabulary and simultaneously learn interpretable word embeddings. Further, we introduce multiple methods based on alternating gradient descent to efficiently train constrained DEDICOM algorithms. We evaluate the qualitative topic modeling and word embedding performance of our proposed methods on several datasets, including a novel New York Times news dataset, and demonstrate how the DEDICOM algorithm provides deeper text analysis than competing matrix factorization approaches.

Download Full-text

Semantic Enhanced Distantly Supervised Relation Extraction via Graph Attention Network

Information ◽

10.3390/info11110528 ◽

2020 ◽

Vol 11 (11) ◽

pp. 528

Author(s):

Xiaoye Ouyang ◽

Shudong Chen ◽

Rong Wang

Keyword(s):

New York ◽

New York Times ◽

Relation Extraction ◽

Extraction Methods ◽

Semantic Features ◽

Attention Networks ◽

Word Position ◽

Supervised Methods ◽

Dependency Trees ◽

Type Information

Distantly Supervised relation extraction methods can automatically extract the relation between entity pairs, which are essential for the construction of a knowledge graph. However, the automatically constructed datasets comprise amounts of low-quality sentences and noisy words, and the current Distantly Supervised methods ignore these noisy data, resulting in unacceptable accuracy. To mitigate this problem, we present a novel Distantly Supervised approach SEGRE (Semantic Enhanced Graph attention networks Relation Extraction) for improved relation extraction. Our model first uses word position and entity type information to provide abundant local features and background knowledge. Then it builds the dependency trees to remove noisy words that are irrelevant to relations and employs Graph Attention Networks (GATs) to encode syntactic information, which also captures the important semantic features of relational words in each instance. Furthermore, to make our model more robust against noisy words, the intra-bag attention module is used to weight the bag representation and mitigate noise in the bag. Through extensive experiments on Riedel New York Times (NYT) and Google IISc Distantly Supervised (GIDS) datasets, we demonstrate SEGRE’s effectiveness.

Download Full-text

Modelagem Probabilística de Tópicos: Uma Comparação Empírica

10.5753/erbd.2021.17237 ◽

2021 ◽

Author(s):

Leonardo H. Rocha ◽

Daniel Welter ◽

Denio Duarte

Keyword(s):

New York ◽

Matrix Factorization ◽

Dirichlet Process ◽

Latent Dirichlet Allocation ◽

New York Times ◽

Multinomial Regression ◽

Hierarchical Dirichlet Process ◽

Non Negative Matrix Factorization ◽

Dirichlet Allocation

Abordagens probabilísticas de tópicos são ferramentas para descobrir e explorar estruturas temáticas escondidas em coleções de textos. Dada uma coleção de documentos, a tarefa de extrair os tópicos consiste em criar um vocabulário a partir da coleção, verificar a probabilidade de cada palavra pertencer a um documento da coleção. Em seguida, baseado no número de tópicos desejado, a probabilidade de cada palavra estar associada a um determinado tópico é contabilizada. Assim, um tópico é um conjunto de palavras ordenadas pela probabilidade de estar associada ao tópico. Várias abordagens são encontradas na literatura para criação de modelos de tópicos, e.g., Hierarchical Dirichlet Process (HDP), Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF) e Dirichlet-multinomial Regression (DMR). Este trabalho procura identificar a qualidade dos tópicos construídos pelas quatro abordagens citadas. A Qualidade será medida por métricas de coerência e todas as abordagens terão a mesma coleção de documentos como entrada: notícias de websites dos jornais Breibart, Business Insider, The Atlantic, CNN e New York Times contendo 50.000 artigos. Os resultados mostram que DMR e LDA são os melhores modelos para extrair tópicos da coleção utilizada.

Download Full-text

SAM-Net: Integrating Event-Level and Chain-Level Attentions to Predict What Happens Next

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016802 ◽

2019 ◽

Vol 33 ◽

pp. 6802-6809 ◽

Cited By ~ 1

Author(s):

Shangwen Lv ◽

Wanhui Qian ◽

Longtao Huang ◽

Jizhong Han ◽

Songlin Hu

Keyword(s):

New York ◽

State Of The Art ◽

New York Times ◽

Event Sequences ◽

Text Understanding ◽

Event Prediction ◽

Subsequent Event ◽

Chain Sequence ◽

The Individual ◽

Chain Level

Scripts represent knowledge of event sequences that can help text understanding. Script event prediction requires to measure the relation between an existing chain and the subsequent event. The dominant approaches either focus on the effects of individual events, or the influence of the chain sequence. However, only considering individual events will lose much semantic relations within the event chain, and only considering the sequence of the chain will introduce much noise. With our observations, both the individual events and the event segments within the chain can facilitate the prediction of the subsequent event. This paper develops self attention mechanism to focus on diverse event segments within the chain and the event chain is represented as a set of event segments. We utilize the event-level attention to model the relations between subsequent events and individual events. Then, we propose the chain-level attention to model the relations between subsequent events and event segments within the chain. Finally, we integrate event-level and chain-level attentions to interact with the chain to predict what happens next. Comprehensive experiment results on the widely used New York Times corpus demonstrate that our model achieves better results than other state-of-the-art baselines by adopting the evaluation of Multi-Choice Narrative Cloze task.

Download Full-text

Linguistic positivity in historical texts reflects dynamic environmental and psychological factors

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1612058113 ◽

2016 ◽

Vol 113 (49) ◽

pp. E7871-E7879 ◽

Cited By ~ 17

Author(s):

Rumen Iliev ◽

Joe Hoover ◽

Morteza Dehghani ◽

Robert Axelrod

Keyword(s):

New York ◽

American English ◽

New York Times ◽

Essential Dimension ◽

Cognitive Mechanisms ◽

Societal Factors ◽

Historical Texts ◽

Text Corpora ◽

Google Books ◽

Scientific Questions

People use more positive words than negative words. Referred to as “linguistic positivity bias” (LPB), this effect has been found across cultures and languages, prompting the conclusion that it is a panhuman tendency. However, although multiple competing explanations of LPB have been proposed, there is still no consensus on what mechanism(s) generate LPB or even on whether it is driven primarily by universal cognitive features or by environmental factors. In this work we propose that LPB has remained unresolved because previous research has neglected an essential dimension of language: time. In four studies conducted with two independent, time-stamped text corpora (Google books Ngrams and the New York Times), we found that LPB in American English has decreased during the last two centuries. We also observed dynamic fluctuations in LPB that were predicted by changes in objective environment, i.e., war and economic hardships, and by changes in national subjective happiness. In addition to providing evidence that LPB is a dynamic phenomenon, these results suggest that cognitive mechanisms alone cannot account for the observed dynamic fluctuations in LPB. At the least, LPB likely arises from multiple interacting mechanisms involving subjective, objective, and societal factors. In addition to having theoretical significance, our results demonstrate the value of newly available data sources in addressing long-standing scientific questions.

Download Full-text

Inhaltsanalyse elektronisch gespeicherter Massendaten der internationalen Presse

Zeitschrift für Medienpsychologie ◽

10.1026//1617-6383.15.3.98 ◽

2003 ◽

Vol 15 (3) ◽

pp. 98-105 ◽

Cited By ~ 1

Author(s):

Mark Galliker ◽

Jan Herman

Keyword(s):

New York ◽

New York Times

Zusammenfassung. Am Beispiel der Repräsentation von Mann und Frau in der Times und in der New York Times wird ein inhaltsanalytisches Verfahren vorgestellt, das sich besonders für die Untersuchung elektronisch gespeicherter Printmedien eignet. Unter Co-Occurrence-Analyse wird die systematische Untersuchung verbaler Kombinationen pro Zähleinheit verstanden. Diskutiert wird das Problem der Auswahl der bei der Auswertung und Darstellung der Ergebnisse berücksichtigten semantischen Einheiten.

Download Full-text