From Frequencies to Vectors

2020 ◽  
Author(s):  
Bernhard Rieder

This chapter investigates early attempts in information retrieval to tackle the full text of document collections. Underpinning a large number of contemporary applications, from search to sentiment analysis, the concepts and techniques pioneered by Hans Peter Luhn, Gerard Salton, Karen Spärck Jones, and others involve particular framings of language, meaning, and knowledge. They also introduce some of the fundamental mathematical formalisms and methods running through information ordering, preparing the extension to digital objects other than text documents. The chapter discusses the considerable technical expressivity that comes out of the sprawling landscape of research and experimentation that characterizes the early decades of information retrieval. This includes the emergence of the conceptual construct and intermediate data structure that is fundamental to most algorithmic information ordering: the feature vector.

2009 ◽  
pp. 931-939
Author(s):  
László Kovács ◽  
Domonkos Tikk

Current databases are able to store several Tbytes of free-text documents. The main purpose of a database from the user’s viewpoint is the efficient information retrieval. In the case of textual data, information retrieval mostly concerns the selection and the ranking of documents. We present here the particular solution of Oracle; there for making the full-text querying more efficient, a special engine was developed that performs the preparation of full-text queries and provides a set of language and semantic specific query operators.


Author(s):  
Shweta Gupta ◽  
Sunita Yadav ◽  
Rajesh Prasad

Document retrieval plays a crucial role in retrieving relevant documents. Relevancy depends upon the occurrences of query keywords in a document. Several documents include a similar key terms and hence they need to be indexed. Most of the indexing techniques are either based on inverted index or full-text index. Inverted index create lists and support word-based pattern queries. While full-text index handle queries comprise of any sequence of characters rather than just words. Problems arise when text cannot be separated as words in some western languages. Also, there are difficulties in space used by compressed versions of full-text indexes. Recently, one of the unique data structure called wavelet tree has been popular in the text compression and indexing. It indexes words or characters of the text documents and help in retrieving top ranked documents more efficiently. This paper presents a review on most recent efficient indexing techniques used in document retrieval.


2016 ◽  
Vol 3 (4) ◽  
pp. 64-82 ◽  
Author(s):  
Shweta Gupta ◽  
Sunita Yadav ◽  
Rajesh Prasad

Document retrieval plays a crucial role in retrieving relevant documents. Relevancy depends upon the occurrences of query keywords in a document. Several documents include a similar key terms and hence they need to be indexed. Most of the indexing techniques are either based on inverted index or full-text index. Inverted index create lists and support word-based pattern queries. While full-text index handle queries comprise of any sequence of characters rather than just words. Problems arise when text cannot be separated as words in some western languages. Also, there are difficulties in space used by compressed versions of full-text indexes. Recently, one of the unique data structure called wavelet tree has been popular in the text compression and indexing. It indexes words or characters of the text documents and help in retrieving top ranked documents more efficiently. This paper presents a review on most recent efficient indexing techniques used in document retrieval.


2019 ◽  
Vol 8 (3) ◽  
pp. 6634-6643 ◽  

Opinion mining and sentiment analysis are valuable to extract the useful subjective information out of text documents. Predicting the customer’s opinion on amazon products has several benefits like reducing customer churn, agent monitoring, handling multiple customers, tracking overall customer satisfaction, quick escalations, and upselling opportunities. However, performing sentiment analysis is a challenging task for the researchers in order to find the users sentiments from the large datasets, because of its unstructured nature, slangs, misspells and abbreviations. To address this problem, a new proposed system is developed in this research study. Here, the proposed system comprises of four major phases; data collection, pre-processing, key word extraction, and classification. Initially, the input data were collected from the dataset: amazon customer review. After collecting the data, preprocessing was carried-out for enhancing the quality of collected data. The pre-processing phase comprises of three systems; lemmatization, review spam detection, and removal of stop-words and URLs. Then, an effective topic modelling approach Latent Dirichlet Allocation (LDA) along with modified Possibilistic Fuzzy C-Means (PFCM) was applied to extract the keywords and also helps in identifying the concerned topics. The extracted keywords were classified into three forms (positive, negative and neutral) by applying an effective machine learning classifier: Convolutional Neural Network (CNN). The experimental outcome showed that the proposed system enhanced the accuracy in sentiment analysis up to 6-20% related to the existing systems.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


Author(s):  
Ida Stadig ◽  
Therese Svanberg

Abstract Objectives This article aims to provide a brief review of information retrieval and hospital-based health technology assessment (HB-HTA) and describe library experiences and working methods at a regional HB-HTA center from the center's inception to the present day. Methods For this brief literature review, searches in PubMed and LISTA were conducted to identify studies reporting on HB-HTA and information retrieval. The description of the library's involvement in the HTA center and its working methods is based on the authors’ experience and internal and/or unpublished documents. Results Region Västra Götaland is the second largest healthcare region in Sweden and has had a regional HB-HTA center since 2007 (HTA-centrum). Assessments are performed by clinicians supported by HTA methodologists. The medical library at Sahlgrenska University Hospital works closely with HTA-centrum, with one HTA librarian responsible for coordinating the work. Conclusion In the literature on HB-HTA, we found limited descriptions of the role librarians and information specialists play in different units. The librarians at HTA-centrum play an important role, not only in literature searching but also in abstract and full-text screening.


2019 ◽  
Vol 27 (3) ◽  
pp. 449-456
Author(s):  
James R Rogers ◽  
Hollis Mills ◽  
Lisa V Grossman ◽  
Andrew Goldstein ◽  
Chunhua Weng

Abstract Scientific commentaries are expected to play an important role in evidence appraisal, but it is unknown whether this expectation has been fulfilled. This study aims to better understand the role of scientific commentary in evidence appraisal. We queried PubMed for all clinical research articles with accompanying comments and extracted corresponding metadata. Five percent of clinical research studies (N = 130 629) received postpublication comments (N = 171 556), resulting in 178 882 comment–article pairings, with 90% published in the same journal. We obtained 5197 full-text comments for topic modeling and exploratory sentiment analysis. Topics were generally disease specific with only a few topics relevant to the appraisal of studies, which were highly prevalent in letters. Of a random sample of 518 full-text comments, 67% had a supportive tone. Based on our results, published commentary, with the exception of letters, most often highlight or endorse previous publications rather than serve as a prominent mechanism for critical appraisal.


Author(s):  
I. P. Komenda

The publication deals with the initial stages of inclusion into the electronic catalogue of bibliographic records of electronic periodicals from eLIBRARY.RU platform and electronic serials which have been subscribed by the Central Science Library of the NAS of Belarus. The activities on addition of full text documents and tables of contents of periodicals into bibliographic records have been considered.


Sign in / Sign up

Export Citation Format

Share Document