A Comparison of Search Functionalities in Several Tools Used for Searching within Digital Text Collections

2021 ◽  
Vol 58 (1) ◽  
pp. 679-681
Author(s):  
Liezl H. Ball ◽  
Theo J.D. Bothma
Author(s):  
Truus Kruyt

This paper discusses the advantages of encoded digital text over printed text,from a researcher's perspective. The traditional notion of text corpus as a well-consideredcollection of texts is related to the huge amounts of digital textsthat are currently available on the web. After examples of useful digitalizationinitiatives and available digital resources, information is given about the usersand uses of the text corpora stored at the lnstitute for Dutch Lexicology.Attention is paid to some obstacles in building or using text collections. Theconclusion is that up till now the digital medium primarily facilitates researchrather than evokes new linguistic research questions.


2008 ◽  
Vol 57 (1) ◽  
pp. 52-71 ◽  
Author(s):  
Timm Lehmberg ◽  
Dr. Georg Rehm ◽  
Dr. Andreas Witt ◽  
Felix Zimmermann

2016 ◽  
Vol 27 (2) ◽  
pp. 156
Author(s):  
Prihantoro Prihantoro

The research problems in this research are 1) how lexicogrammar takes role in determining polarity of F-Word1 and 2) how to formalize it for corpus processing. The data is obtained from the Contemporary American English Corpus (COCA). In this corpus, F-word is proven to be highest in frequency as compared to its distribution across corpora. Corpus methodology is applied by sending queries to retrieve F-Words to COCA interface. Tokens combination surrounding F-words resulted in the phrase and clause unit accompanying F-words, which are significant cues to determine F-word polarity. The polarity is later proven to be not necessarily negative. I also designed a computational resource to allow the retrieval of F-words offline so that users might apply it to any digital text collections.


2020 ◽  
Author(s):  
Liezl Ball ◽  
◽  
Theo Bothma ◽  

Introduction. With the increase in the availability of digital text collections for humanities researchers, tools to enable enhanced retrieval are required. If words with very specific properties could be retrieved from a text collection more accurate linguistic and other analyses can be made. There are a range of properties and metadata that could be specified for retrieval, from morphological data up to bibliographic data. Furthermore, the bibliographic data should not only be on item level but extended to the text-level. For example, in an anthology each section could be encoded with the author of that section. Such extended metadata will enable fine-grained retrieval. Method. In this study, current tools were evaluated to determine to what extent they allow users to retrieve words with specific properties from a text collection. Analysis. The analysis is limited to the following criteria: interface design, metadata, search options, filtering and search results. Results. Currently, it is not possible for a user to retrieve words with specific properties from a text collection. Conclusion. An extended set of metadata should be used to encode text to enable retrieval of words on a fine-grained level.


Author(s):  
Peter Organisciak ◽  
Grace Therrell ◽  
Maggie Ryan ◽  
Benjamin MacDonald Schmidt
Keyword(s):  

Author(s):  
Luke Gallagher ◽  
Antonio Mallia ◽  
J. Shane Culpepper ◽  
Torsten Suel ◽  
B. Barla Cambazoglu

2004 ◽  
Vol 30 (1) ◽  
pp. 75-93 ◽  
Author(s):  
Haodi Feng ◽  
Kang Chen ◽  
Xiaotie Deng ◽  
Weimin Zheng

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.


Sign in / Sign up

Export Citation Format

Share Document