scholarly journals Measuring the Extent of the Synonym Problem in Full-Text Searching

2008 ◽  
Vol 3 (4) ◽  
pp. 18 ◽  
Author(s):  
Jeffrey Beall ◽  
Karen Kafadar

Objective – This article measures the extent of the synonym problem in full-text searching. The synonym problem occurs when a search misses documents because the search was based on a synonym and not on a more familiar term. Methods – We considered a sample of 90 single word synonym pairs and searched for each word in the pair, both singly and jointly, in the Yahoo! database. We determined the number of web sites that were missed when only one but not the other term was included in the search field. Results – Depending upon how common the usage is of the synonym, the percentage of missed web sites can vary from almost 0% to almost 100%. When the search uses a very uncommon synonym ("diaconate"), a very high percentage of web pages can be missed (95%), versus the search using the more common term (only 9% are missed when searching web pages for the term "deacons"). If both terms in a word pair were nearly equal in usage ("cooks" and "chefs"), then a search on one term but not the other missed almost half the relevant web pages. Conclusion – Our results indicate great value for search engines to incorporate automatic synonym searching not only for user-specified terms but also for high usage synonyms. Moreover, the results demonstrate the value of information retrieval systems that use controlled vocabularies and cross references to generate search results.

Author(s):  
Bahare Hashemzahde ◽  
Majid Abdolrazzagh-Nezhad

The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80%, 60.65%, and 91.3%, respectively.


Author(s):  
Antonio Picariello

Information retrieval can take great advantages and improvements considering users’ feedbacks. Therefore, the user dimension is a relevant component that must be taken into account while planning and implementing real information retrieval systems. In this chapter, we first describe several concepts related to relevance feedback methods, and then propose a novel information retrieval technique which uses the relevance feedback concepts in order to improve accuracy in an ontology-based system. In particular, we combine the Semantic information from a general knowledge base with statistical information using relevance feedback. Several experiments and results are presented using a test set constituted of Web pages.


Author(s):  
Antonio Picariello ◽  
Antonio M. Rinaldi

The user dimension is a crucial component in the information retrieval process and for this reason it must be taken into account in planning and technique implementation in information retrieval systems. In this paper we present a technique based on relevance feedback to improve the accuracy in an ontology based information retrieval system. Our proposed method combines the semantic information in a general knowledge base with statistical information using relevance feedback. Several experiments and results are presented using a test set constituted of Web pages.


2014 ◽  
Vol 602-605 ◽  
pp. 3706-3711
Author(s):  
Hao Chen ◽  
Qin Qun Chen ◽  
Shao Xia YE

In recent years, much research has been devoted to the analysis of 128 bit architectures; on the other hand, few have evaluated the construction of wide-area networks. In fact, few cyberneticists would disagree with the understanding of IPv6. This is an important point to understand. we describe an autonomous tool for developing compilers, which we call ADZ.


Author(s):  
Indrawan Maria ◽  
Loke Seng

The debate on the effectiveness of ontology in solving semantic problems has increased recently in many domains of information technology. One side of the debate accepts the inclusion of ontology as a suitable solution. The other side of the debate argues that ontology is far from an ideal solution to the semantic problem. This article explores this debate in the area of information retrieval. Several past approaches were explored and a new approach was investigated to test the effectiveness of a generic ontology such as WordNet in improving the performance of information retrieval systems. The test and the analysis of the experiments suggest that WordNet is far from the ideal solution in solving semantic problems in the information retrieval. However, several observations have been made and reported in this article that allow research in ontology for the information retrieval to move towards the right direction.


Author(s):  
José Antonio García-Díaz ◽  
Rafael Valencia-García

AbstractSatirical content on social media is hard to distinguish from real news, misinformation, hoaxes or propaganda when there are no clues as to which medium these news were originally written in. It is important, therefore, to provide Information Retrieval systems with mechanisms to identify which results are legitimate and which ones are misleading. Our contribution for satire identification is twofold. On the one hand, we release the Spanish SatiCorpus 2021, a balanced dataset that contains satirical and non-satirical documents. On the other hand, we conduct an extensive evaluation of this dataset with linguistic features and embedding-based features. All feature sets are evaluated separately and combined using different strategies. Our best result is achieved with a combination of the linguistic features and BERT with an accuracy of 97.405%. Besides, we compare our proposal with existing datasets in Spanish regarding satire and irony.


2020 ◽  
Vol 40 (02) ◽  
pp. 437-444
Author(s):  
Padmavathi T

The current methods of searching and information retrieval are imprecise, often yielding results in tens of thousands of web pages. Extraction of the actual information needed often requires extensive manual browsing of retrieved documents. In order to address these drawbacks, this paper introduces an implementation in the field of food science of the ontology-based information retrieval system, and comparison is made with conventional information systems. The ontology of Food Semantic Web Knowledge Base (FSWKB) was built using the Protégé framework which supports two main models of ontology through the editors Protégé-Frames and Protégé-OWL. The FSWKB is composed of two heterogeneous ontologies, and these are merged and processed on a separate server application making use of the Apache Jena Fuseki an SPARQL server offering SPARQL endpoint. The experimental results indicated that ontology-based information systems are more effective in terms of their retrieval capability compared to the more conventional information retrieval systems. The retrieval effectiveness was measured in terms of precision and recall. The results of the work showed that traditional search results in average precision and recall levels of 0.92 and 0.18. The ontology-based test for precision and recall has average rates of 0.96 and 0.97.


2016 ◽  
pp. 044-050 ◽  
Author(s):  
A.M. Glybovets ◽  
◽  

Methods of extraction and analysis of data – a relatively new and promising branch of computer science, has found its application in information retrieval systems. An algorithm of relationships and dependencies searching in the collections of Web pages. The algorithm does not provide relevant search resources. This function is performed by the search engine. It also produces cleaning, integration, and data selection. A special feature of the algorithm is to use the existing data store (search engine or data storage), language independence and ease of implementation.


2020 ◽  
Vol 10 (3) ◽  
pp. 57-73
Author(s):  
Prem Sagar Sharma ◽  
Divakar Yadav

Web-based information retrieval systems called search engines have made things easy for information seekers, but still do not provide guarantees about the relevance of the information provided to the users. Information retrieval systems provide the information to the user based on certain retrieval criteria. Due to the large size of the WWW, it is very common that a large number of documents get identified related to a particular domain. Therefore, to help users towards finding the best matching documents, a ranking mechanism is employed by the search engine. In this article, an improved architecture for an information retrieval system is proposed. The proposed system makes a query log for each user query and stores the results retrieved to the user for that query. The system also provides relevant results by analyzing the content of the pages retrieved for the user query.


Sign in / Sign up

Export Citation Format

Share Document