scholarly journals 17. Text as Data: Finding Stories in Text Collections

2021 ◽  
pp. 116-123
Author(s):  
Barbara Maseda
Keyword(s):  
Author(s):  
Peter Organisciak ◽  
Grace Therrell ◽  
Maggie Ryan ◽  
Benjamin MacDonald Schmidt
Keyword(s):  

Author(s):  
Luke Gallagher ◽  
Antonio Mallia ◽  
J. Shane Culpepper ◽  
Torsten Suel ◽  
B. Barla Cambazoglu

2004 ◽  
Vol 30 (1) ◽  
pp. 75-93 ◽  
Author(s):  
Haodi Feng ◽  
Kang Chen ◽  
Xiaotie Deng ◽  
Weimin Zheng

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.


Author(s):  
Yazhong Zhang ◽  
Hanbing Zhang ◽  
Zhenying He ◽  
Yinan Jing ◽  
Kai Zhang ◽  
...  

2008 ◽  
Vol 1 (1) ◽  
pp. 945-957 ◽  
Author(s):  
Sanjay Agrawal ◽  
Kaushik Chakrabarti ◽  
Surajit Chaudhuri ◽  
Venkatesh Ganti

2022 ◽  
Vol 3 (1) ◽  
pp. 1-16
Author(s):  
Haoran Ding ◽  
Xiao Luo

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.


Sign in / Sign up

Export Citation Format

Share Document