scholarly journals An Optimum Approach for Preprocessing of Web User Query

Author(s):  
Sunny Sharma ◽  
Sunita Sunita ◽  
Arjun Kumar ◽  
Vijay Rana

<span lang="EN-US">The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.</span>

2019 ◽  
Vol 9 (3) ◽  
pp. 23-47
Author(s):  
Sumita Gupta ◽  
Neelam Duhan ◽  
Poonam Bansal

With the rapid growth of digital information and user need, it becomes imperative to retrieve relevant and desired domain or topic specific documents as per the user query quickly. A focused crawler plays a vital role in digital libraries to crawl the web so that researchers can easily explore the domain specific search results list and find the desired content against the query. In this article, a focused crawler is being proposed for online digital library search engines, which considers meta-data of the query in order to retrieve the corresponding document or other relevant but missing information (e.g. paid publication from ACM, IEEE, etc.) against the user query. The different query strategies are made by using the meta-data and submitted to different search engines which aim to find more relevant information which is missing. The result comes out from these search engines are filtered and then used further for crawling the Web.


Author(s):  
Cláudio Elízio Calazans Campelo ◽  
Cláudio de Souza Baptista ◽  
Ricardo Madeira Fernandes

It is well known that documents available on the Web are extremely heterogeneous in several aspects, such as the use of various idioms, different formats to represent the contents, besides other external factors like source reputation, refresh frequency, and so forth (Page & Brin, 1998). Altogether, these factors increase the complexity of Web information retrieval systems. Superficially, traditional search engines available on the Web nowadays consist of retrieving documents that contain keywords informed by users. Nevertheless, among the variety of search possibilities, it is evident that the user needs a process that involves more sophisticated analysis; for example, temporal or spatial contextualization might be considered. In these keyword-based search engines, for instance, a Web page containing the phrase “…due to the company arrival in London, a thousand java programming jobs will be open…” would not be found if the submitted search was “jobs programming England,” unless the word “England” appeared in another phrase of the page. The explanation to this fact is that the term “London” is treated merely like another word, instead of regarding its geographical position. In a spatial search engine, the expected behavior would be to return the page described in the previous example, since the system shall have information indicating that the term “London” refers to a city located in a country referred to by the term “England.” This result could only be feasible in a traditional search engine if the user repeatedly submitted searches for all possible England sub-regions (e.g., cities). In accordance with the example, it is reasonable that for several user searches, the most interesting results are those related to certain geographical regions. A variety of features extraction and automatic document classification techniques have been proposed, however, acquiring Web-page geographical features involves some peculiar complexities, such as ambiguity (e.g., many places with the same name, various names for a single place, things with place names, etc.). Moreover, a Web page can refer to a place that contains or is contained by the one informed in the user query, which implies knowing the different region topologies used by the system. Many features related to geographical context can be added to the process of elaborating relevance ranking for returned documents. For example, a document can be more relevant than another one if its content refers to a place closer to the user location. Nonetheless, in spatial search engines, there are more complex issues to be considered because of the spatial dimension concerning on ranking elaboration. Jones, Alani, and Tudhope (2001) propose a combination of Euclidian distance between place centroids with hierarchical distances in order to generate a hybrid spatial distance that may be used in the relevance ranking elaboration of returned documents. Further important issues are the indexing mechanisms and query processing. In general, these solutions try to combine well-known textual indexing techniques (e.g., inverted files) with spatial indexing mechanisms. On the subject of user interface, spatial search engines are more complex, because users need to choose regions of interest, as well as possible spatial relationships, in addition to keywords. To visualize the results, it is pleasant to use digital map resources besides textual information.


Author(s):  
Massimiliano Caramia ◽  
Giovanni Felici

In the present chapter we report on some extensions on the work presented in the first edition of the Encyclopedia of Data Mining. In Caramia and Felici (2005) we have described a method based on clustering and a heuristic search method- based on a genetic algorithm - to extract pages with relevant information for a specific user query in a thematic search engine. Starting from these results we have extended the research work trying to match some issues related to the semantic aspects of the search, focusing on the keywords that are used to establish the similarity among the pages that result from the query. Complete details on this method, here omitted for brevity, can be found in Caramia and Felici (2006). Search engines technologies remain a strong research topic, as new problems and new demands from the market and the users arise. The process of switching from quantity (maintaining and indexing large databases of web pages and quickly select pages matching some criterion) to quality (identifying pages with a high quality for the user), already highlighted in Caramia and Felici (2005), has not been interrupted, but has gained further energy, being motivated by the natural evolution of the internet users, more selective in their choice of the search tool and willing to pay the price of providing extra feedback to the system and wait more time to have their queries better matched. In this framework, several have considered the use of data mining and optimization techniques, that are often referred to as web mining (for a recent bibliography on this topic see, e.g., Getoor, Senator, Domingos, and Faloutsos, 2003 and Zaïane, Srivastava, Spiliopoulou, and Masand, 2002). The work described in this chapter is bases on clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then, a number of small and potentially good subsets of pages is constructed, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, a subset with a good overall score and a high internal dissimilarity is identified. A related problem is then considered: the selection of a subset of pages that are compliant with the search keywords, but that also are characterized by the fact that they share a large subset of words different from the search keywords. This characteristic represents a sort of semantic connection of these pages that may be of use to spot some particular aspects of the information present in the pages. Such a task is accomplished by the construction of a special graph, whose maximumweight clique and k-densest subgraph should represent the page subsets with the desired properties. In the following we summarize the main background topics and provide a synthetic description of the methods. Interested readers may find additional information in Caramia and Felici (2004), Caramia and Felici (2005), and Caramia and Felici (2006).


Author(s):  
Rajeev Gupta ◽  
Virender Singh

Purpose: With the popularity and remarkable usage of digital images in various domains, the existing image retrieval techniques need to be enhanced. The content-based image retrieval is playing a vital role to retrieve the requested data from the database available in cyberspace. CBIR from cyberspace is a popular and interesting research area nowadays for a better outcome. The searching and downloading of the requested images accurately based on meta-data from the cyberspace by using CBIR techniques is a challenging task. The purpose of this study is to explore the various image retrieval techniques for retrieving the data available in cyberspace.  Methodology: Whenever a user wishes to retrieve an image from the web, using present search engines, a bunch of images is retrieved based on a user query. But, most of the resultant images are unrelated to the user query. Here, the user puts their text-based query in the web-based search engine and compute the related images and retrieval time. Main Findings:  This study compares the accuracy and retrieval-time of the requested image. After the detailed analysis, the main finding is none of the used web-search engines viz. Flickr, Pixabay, Shutterstock, Bing, Everypixel, retrieved the accurate related images based on the entered query.   Implications: This study is discussing and performs a comparative analysis of various content-based image retrieval techniques from cyberspace. Novelty of Study: Research community has been making efforts towards efficient retrieval of useful images from the web but this problem has not been solved and it still prevails as an open research challenge. This study makes some efforts to resolve this research challenge and perform a comparative analysis of the outcome of various web-search engines.


2013 ◽  
Vol 10 (9) ◽  
pp. 1969-1976
Author(s):  
Sathya Bama ◽  
M.S.Irfan Ahmed ◽  
A. Saravanan

The growth of internet is increasing continuously by which the need for improving the quality of services has been increased. Web mining is a research area which applies data mining techniques to address all this need. With billions of pages on the web it is very intricate task for the search engines to provide the relevant information to the users. Web structure mining plays a vital role by ranking the web pages based on user query which is the most essential attempt of the web search engines. PageRank, Weighted PageRank and HITS are the commonly used algorithm in web structure mining for ranking the web page. But all these algorithms treat all links equally when distributing initial rank scores. In this paper, an improved page rank algorithm is introduced. The result shows that the algorithm has better performance over PageRank algorithm.


2018 ◽  
Vol 2 (1) ◽  
pp. 805-813
Author(s):  
Jerzy Stachowicz

Abstract The use of search engines such as Google is an activity that produces a transformation of communication practices related to the use of digital devices (especially smartphones) and has a significant impact on Internet users’ linguistic practices. One of these practices is conversation-not Internet chat, but “ordinary” face-to-face dialogue. People often search the web during conversations. This practice transforms a simple conversation into a digitally assisted one. A digitally assisted conversation is a dynamic combination of speaking, typing and reading on the screen. In this paper, I present some consequences of this change, such as the way searching during conversations “forces” interlocutors to take a different look at their statements and why reaching for a smartphone and using a search engine can be perceived, regardless of the results displayed on the screen, as a significant rhetorical gesture of negation (usually considered rude). Proficiency in searching and using a smartphone with broadband Internet is considered socially attractive today, just as erudition and literacy once were. This is currently considered an extension of erudition.


The classical Web search engines focus on satisfying the information need of the users by retrieving relevant Web documents corresponding to the user query. The Web document contains the information on different Web objects such as authors, automobiles, political parties e.t.c. The user might be accessing the Web document to procure information about a specific Web object, the remaining information in the Web object [2-6] becomes redundant specific to the user. If the size of Web documents is significantly large and the user information requirement is small fraction of the document, the user has to invest effort in locating the required information inside the document. It would be much more convenient if the user is provided with only the required Web object information located inside the Web documents. Web object search engines provide Web search facility through vertical search on Web objects. In this paper the main goal we considered is the objective information present in different documents is extracted and integrated into an object repository over which the Web object search facility is built.


Big Data ◽  
2016 ◽  
pp. 1970-1986
Author(s):  
Nawaf A. Abdulla ◽  
Nizar A. Ahmed ◽  
Mohammed A. Shehab ◽  
Mahmoud Al-Ayyoub ◽  
Mohammed N. Al-Kabi ◽  
...  

The emergence of the Web 2.0 technology generated a massive amount of raw data by enabling Internet users to post their opinions on the web. Processing this raw data to extract useful information can be a very challenging task. An example of important information that can be automatically extracted from the users' posts is their opinions on different issues. This problem of Sentiment Analysis (SA) has been studied well on the English language and two main approaches have been devised: corpus-based and lexicon-based. This work focuses on the later approach due to its various challenges and high potential. The discussions in this paper take the reader through the detailed steps of building the main two components of the lexicon-based SA approach: the lexicon and the SA tool. The experiments show that significant efforts are still needed to reach a satisfactory level of accuracy for the lexicon-based Arabic SA. Nonetheless, they do provide an interesting guide for the researchers in their on-going efforts to improve lexicon-based SA.


Author(s):  
Nawaf A. Abdulla ◽  
Nizar A. Ahmed ◽  
Mohammed A. Shehab ◽  
Mahmoud Al-Ayyoub ◽  
Mohammed N. Al-Kabi ◽  
...  

The emergence of the Web 2.0 technology generated a massive amount of raw data by enabling Internet users to post their opinions on the web. Processing this raw data to extract useful information can be a very challenging task. An example of important information that can be automatically extracted from the users' posts is their opinions on different issues. This problem of Sentiment Analysis (SA) has been studied well on the English language and two main approaches have been devised: corpus-based and lexicon-based. This work focuses on the later approach due to its various challenges and high potential. The discussions in this paper take the reader through the detailed steps of building the main two components of the lexicon-based SA approach: the lexicon and the SA tool. The experiments show that significant efforts are still needed to reach a satisfactory level of accuracy for the lexicon-based Arabic SA. Nonetheless, they do provide an interesting guide for the researchers in their on-going efforts to improve lexicon-based SA.


2019 ◽  
Vol 8 (3) ◽  
pp. 6371-6375

The innovation of web produced a huge of information, evaluates by empowering Internet users to post their assessments, remarks, and audits on the web. Preprocessing helps to understand a user query in the Information Retrieval (IR) system. IR acts as the container to representation, seeking and access information that relates to a user search string. The information is present in natural language by using some words; it’s not structured format, and sometimes that word often ambiguous. One of the major challenges determines in current web search vocabulary mismatch problem during the preprocessing. In an IR system determine a drawback in web search; the search query string is that the relationships between the query expressions and the expanded terms are limited. The query expressions relate to search term fetching information from the IR. The expanded terms by adding those terms that is most similar to the words of the search string. In this manuscript, we mainly focus on behind user’s search string on the web. We identify the best features within this context for term selection in supervised learning based model. In this proposed system the main focus of preprocessing techniques like Tokenization, Stemming, spell check, find dissimilar words and discover the keywords from the user query because provide better results for the user


Sign in / Sign up

Export Citation Format

Share Document