17. Text as Data: Finding Stories in Text Collections

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

Download Full-text

Automatic Term Extraction for Sentiment Classification of Dynamically Updated Text Collections into Three Classes

Knowledge Engineering and the Semantic Web - Communications in Computer and Information Science ◽

10.1007/978-3-319-11716-4_12 ◽

2014 ◽

pp. 140-149 ◽

Cited By ~ 2

Author(s):

Yuliya Rubtsova

Keyword(s):

Sentiment Classification ◽

Term Extraction ◽

Text Collections ◽

Automatic Term Extraction

Download Full-text

Progressive Term Frequency Analysis on Large Text Collections

Database Systems for Advanced Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-030-59416-9_10 ◽

2020 ◽

pp. 158-174

Author(s):

Yazhong Zhang ◽

Hanbing Zhang ◽

Zhenying He ◽

Yinan Jing ◽

Kai Zhang ◽

...

Keyword(s):

Frequency Analysis ◽

Term Frequency ◽

Text Collections

Download Full-text

Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment ◽

10.14778/1453856.1453958 ◽

2008 ◽

Vol 1 (1) ◽

pp. 945-957 ◽

Cited By ~ 11

Author(s):

Sanjay Agrawal ◽

Kaushik Chakrabarti ◽

Surajit Chaudhuri ◽

Venkatesh Ganti

Keyword(s):

Ad Hoc ◽

Entity Extraction ◽

Text Collections

Download Full-text

Detection of Domain-Specific Trends in Text Collections

Communications in Computer and Information Science - Analysis of Images, Social Networks and Texts ◽

10.1007/978-3-319-12580-0_7 ◽

2014 ◽

pp. 78-84

Author(s):

Ilnur Gadelshin ◽

Anna Antonova ◽

Dmitry Ilvovsky

Keyword(s):

Domain Specific ◽

Text Collections

Download Full-text

Attention-based Unsupervised Keyphrase Extraction and Phrase Graph for COVID-19 Medical Literature Retrieval

ACM Transactions on Computing for Healthcare ◽

10.1145/3473939 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-16

Author(s):

Haoran Ding ◽

Xiao Luo

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Language Processing ◽

Medical Literature ◽

Graph Model ◽

The Self ◽

Keyphrase Extraction ◽

Text Data ◽

Text Collections ◽

Extraction Model

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

Download Full-text

Defining the Dynamicity and Diversity of Text Collections

Research and Advanced Technology for Digital Libraries - Lecture Notes in Computer Science ◽

10.1007/978-3-642-15464-5_60 ◽

2010 ◽

pp. 474-477

Author(s):

Ilya Markov ◽

Fabio Crestani

Keyword(s):

Text Collections

Download Full-text