Indexing temporal information for web pages

Peiquan Jin; Hong Chen; Xujian Zhao; Xiaowen Li; Lihua Yue

doi:10.2298/csis100407025j

A Novel Approach for Crawling the Opinions from World Wide Web

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2016040101 ◽

2016 ◽

Vol 6 (2) ◽

pp. 1-23 ◽

Cited By ~ 4

Author(s):

Surbhi Bhatia ◽

Manisha Sharma ◽

Komal Kumar Bhatia

Keyword(s):

World Wide ◽

Opinion Mining ◽

Real Data ◽

User Generated Content ◽

Decision Making Process ◽

Web Pages ◽

Data Sets ◽

Web Technologies ◽

Design And Implementation ◽

Novel Approach

Due to the sudden and explosive increase in web technologies, huge quantity of user generated content is available online. The experiences of people and their opinions play an important role in the decision making process. Although facts provide the ease of searching information on a topic but retrieving opinions is still a crucial task. Many studies on opinion mining have to be undertaken efficiently in order to extract constructive opinionated information from these reviews. The present work focuses on the design and implementation of an Opinion Crawler which downloads the opinions from various sites thereby, ignoring rest of the web. Besides, it also detects web pages which frequently undergo updation by calculating the timestamp for its revisit in order to extract relevant opinions. The performance of the Opinion Crawler is justified by taking real data sets that prove to be much more accurate in terms of precision and recall quality attributes.

Download Full-text

Search Engine

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch016 ◽

2018 ◽

pp. 359-374

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Computer Networks ◽

Search Engines ◽

Web Search ◽

Relevant Information ◽

Vital Role ◽

Deep Web ◽

Telecommunication Networks ◽

Web Pages ◽

Web Crawler ◽

Main Components

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.

Download Full-text

Enhancing Web Search through Query Expansion

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch116 ◽

2011 ◽

pp. 752-757 ◽

Cited By ~ 2

Author(s):

Daniel Crabtree

Keyword(s):

Information Retrieval ◽

Search Engines ◽

Query Expansion ◽

Web Search ◽

User Involvement ◽

Semantic Knowledge ◽

Web Pages ◽

Search Performance ◽

Interactive Query ◽

Web Search Engines

Web search engines help users find relevant web pages by returning a result set containing the pages that best match the user’s query. When the identified pages have low relevance, the query must be refined to capture the search goal more effectively. However, finding appropriate refinement terms is difficult and time consuming for users, so researchers developed query expansion approaches to identify refinement terms automatically. There are two broad approaches to query expansion, automatic query expansion (AQE) and interactive query expansion (IQE) (Ruthven et al., 2003). AQE has no user involvement, which is simpler for the user, but limits its performance. IQE has user involvement, which is more complex for the user, but means it can tackle more problems such as ambiguous queries. Searches fail by finding too many irrelevant pages (low precision) or by finding too few relevant pages (low recall). AQE has a long history in the field of information retrieval, where the focus has been on improving recall (Velez et al., 1997). Unfortunately, AQE often decreased precision as the terms used to expand a query often changed the query’s meaning (Croft and Harper (1979) identified this effect and named it query drift). The problem is that users typically consider just the first few results (Jansen et al., 2005), which makes precision vital to web search performance. In contrast, IQE has historically balanced precision and recall, leading to an earlier uptake within web search. However, like AQE, the precision of IQE approaches needs improvement. Most recently, approaches have started to improve precision by incorporating semantic knowledge.

Download Full-text

Search Engine

International Journal of Information Communication Technologies and Human Development ◽

10.4018/ijicthd.2011040103 ◽

2011 ◽

Vol 3 (2) ◽

pp. 38-51 ◽

Cited By ~ 6

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Computer Networks ◽

Search Engines ◽

Web Search ◽

Relevant Information ◽

Vital Role ◽

Deep Web ◽

Telecommunication Networks ◽

Web Pages ◽

Web Crawler ◽

Main Components

ICT plays a vital role in human development through information extraction and includes computer networks and telecommunication networks. One of the important modules of ICT is computer networks, which are the backbone of the World Wide Web (WWW). Search engines are computer programs that browse and extract information from the WWW in a systematic and automatic manner. This paper examines the three main components of search engines: Extractor, a web crawler which starts with a URL; Analyzer, an indexer that processes words on the web page and stores the resulting index in a database; and Interface Generator, a query handler that understands the need and preferences of the user. This paper concentrates on the information available on the surface web through general web pages and the hidden information behind the query interface, called deep web. This paper emphasizes the Extraction of relevant information to generate the preferred content for the user as the first result of his or her search query. This paper discusses the aspect of deep web with analysis of a few existing deep web search engines.

Download Full-text

World Wide Web Search Technologies

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch554 ◽

2005 ◽

pp. 3111-3117

Author(s):

Wen-Chen Hu ◽

Hung-Jen Yang ◽

Jyh-haw Yeh ◽

Chung-wei Lee

Keyword(s):

Information Retrieval ◽

World Wide Web ◽

Search Engines ◽

World Wide ◽

Web Search ◽

Future Research ◽

Web Pages ◽

Future Research Directions ◽

Web Search Engines ◽

Almost All

The World Wide Web now holds more than six billion pages covering almost all daily issues. The Web’s fast growing size and lack of structural style present a new challenge for information retrieval (Lawrence & Giles, 1999a). Traditional search techniques are based on users typing in search keywords which the search services can then use to locate the desired Web pages. However, this approach normally retrieves too many documents, of which only a small fraction are relevant to the users’ needs. Furthermore, the most relevant documents do not necessarily appear at the top of the query output list. Numerous search technologies have been applied to Web search engines; however, the dominant search methods have yet to be identified. This article provides an overview of the existing technologies for Web search engines and classifies them into six categories: i) hyperlink exploration, ii) information retrieval, iii) metasearches, iv) SQL approaches, v) content-based multimedia searches, and vi) others. At the end of this article, a comparative study of major commercial and experimental search engines is presented, and some future research directions for Web search engines are suggested. Related Web search technology review can also be found in Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan (2001) and Lawrence and Giles (1999b).

Download Full-text

Reliability of women epilepsy related information from main web search engines in China?deceitful web search environment and illumination (Preprint)

10.2196/preprints.7724 ◽

2017 ◽

Author(s):

Xi Zhu ◽

Xiangmiao Qiu ◽

Dingwang Wu ◽

Shidong Chen ◽

Jiwen Xiong ◽

...

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Search ◽

Negative Impact ◽

Academic Publishing ◽

Web Pages ◽

Efficient System ◽

Related Information ◽

Web Search Engines ◽

Electronic Health

BACKGROUND All electronic health practices like app/software are involved in web search engine due to its convenience for receiving information. The success of electronic health has link with the success of web search engines in field of health. Yet information reliability from search engine results remains to be evaluated. A detail analysis can find out setbacks and bring inspiration. OBJECTIVE Find out reliability of women epilepsy related information from the searching results of main search engines in China. METHODS Six physicians conducted the search work every week. Search key words are one kind of AEDs (valproate acid/oxcarbazepine/levetiracetam/ lamotrigine) plus "huaiyun"/"renshen", both of which means pregnancy in Chinese. The search were conducted in different devices (computer/cellphone), different engines (Baidu/Sogou/360). Top ten results of every search result page were included. Two physicians classified every results into 9 categories according to their contents and also evaluated the reliability. RESULTS A total of 16411 searching results were included. 85.1% of web pages were with advertisement. 55% were categorized into question and answers according to their contents. Only 9% of the searching results are reliable, 50.7% are partly reliable, 40.3% unreliable. With the ranking of the searching results higher, advertisement up and the proportion of those unreliable increase. All contents from hospital websites are unreliable at all and all from academic publishing are reliable. CONCLUSIONS Several first principles must be emphasized to further the use of web search engines in field of healthcare. First, identification of registered physicians and development of an efficient system to guide the patients to physicians guarantee the quality of information provided. Second, corresponding department should restrict the excessive advertisement sale trades in healthcare area by specific regulations to avoid negative impact on patients. Third, information from hospital websites should be carefully judged before embracing them wholeheartedly.

Download Full-text

Perspectives and Key Technologies of Semantic Web Search

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch235 ◽

2011 ◽

pp. 1532-1537 ◽

Cited By ~ 1

Author(s):

Konstantinos Kotis

Keyword(s):

Semantic Web ◽

Search Engines ◽

Web Search ◽

Semantic Search ◽

Web Pages ◽

Domain Specific ◽

Web Information ◽

Natural Language Query ◽

Web Search Engines ◽

Semantically Heterogeneous

Current keyword-based Web search engines (e.g. Googlea) provide access to thousands of people for billions of indexed Web pages. Although the amount of irrelevant results returned due to polysemy (one word with several meanings) and synonymy (several words with one meaning) linguistic phenomena tends to be reduced (e.g. by narrowing the search using human- directed topic hierarchies as in Yahoob), still the uncontrolled publication of Web pages requires an alternative to the way Web information is authored and retrieved today. This alternative can be the technologies of the new era of the Semantic Web. The Semantic Web, currently using OWL language to describe content, is an extension and an alternative at the same time to the traditional Web. A Semantic Web Document (SWD) describes its content with semantics, i.e. domain-specific tags related to a specific conceptualization of a domain, adding meaning to the document’s (annotated) content. Ontologies play a key role to providing such description since they provide a standard way for explicit and formal conceptualizations of domains. Since traditional Web search engines cannot easily take advantage of documents’ semantics, e.g. they cannot find documents that describe similar concepts and not just similar words, semantic search engines (e.g. SWOOGLEc, OntoSearchd) and several other semantic search technologies have been proposed (e.g. Semantic Portals (Zhang et al, 2005), Semantic Wikis (Völkel et al, 2006), multi-agent P2P ontology-based semantic routing (of queries) systems (Tamma et al, 2004), and ontology mapping-based query/answering systems (Lopez et al, 2006; Kotis & Vouros, 2006, Bouquet et al, 2004). Within these technologies, queries can be placed as formally described (or annotated) content, and a semantic matching algorithm can provide the exact matching with SWDs that their semantics match the semantics of the query. Although the Semantic Web technology contributes much in the retrieval of Web information, there are some open issues to be tackled. First of all, unstructured (traditional Web) documents must be semantically annotated with domain-specific tags (ontology-based annotation) in order to be utilized by semantic search technologies. This is not an easy task, and requires specific domain ontologies to be developed that will provide such semantics (tags). A fully automatic annotation process is still an open issue. On the other hand, SWDs can be semantically retrieved only by formal queries. The construction of a formal query is also a difficult and time-consuming task since a formal language must be learned. Techniques towards automating the transformation of a natural language query to a formal (structured) one are currently investigated. Nevertheless, more sophisticated technologies such as the mapping of several schemes to a formal query constructed in the form of an ontology must be investigated. The technology is proposed for retrieving heterogeneous and distributed SWDs, since their structure cannot be known a priory (in open environments like the Semantic Web). This article aims to provide an insight on current technologies used in Semantic Web search, focusing on two issues: a) the automatic construction of a formal query (query ontology) and b) the querying of a collection of knowledge sources whose structure is not known a priory (distributed and semantically heterogeneous documents).

Download Full-text

Strategies for Improving the Efficacy of Fusion Question Answering Systems

Principles and Applications of Business Intelligence Research ◽

10.4018/978-1-4666-2650-8.ch013 ◽

2012 ◽

pp. 181-198

Author(s):

José Antonio Robles-Flores ◽

Gregory Schymik ◽

Julie Smith-David ◽

Robert St. Louis

Keyword(s):

Search Engines ◽

Web Site ◽

Web Search ◽

Question Answering ◽

Irrelevant Information ◽

Web Pages ◽

Crowd Sourcing ◽

Complete Answer ◽

Question Answering Systems

Web search engines typically retrieve a large number of web pages and overload business analysts with irrelevant information. One approach that has been proposed for overcoming some of these problems is automated Question Answering (QA). This paper describes a case study that was designed to determine the efficacy of QA systems for generating answers to original, fusion, list questions (questions that have not previously been asked and answered, questions for which the answer cannot be found on a single web site, and questions for which the answer is a list of items). Results indicate that QA algorithms are not very good at producing complete answer lists and that searchers are not very good at constructing answer lists from snippets. These findings indicate a need for QA research to focus on crowd sourcing answer lists and improving output format.

Download Full-text

Strategies for Improving the Efficacy of Fusion Question Answering Systems

International Journal of Business Intelligence Research ◽

10.4018/jbir.2011010104 ◽

2011 ◽

Vol 2 (1) ◽

pp. 46-63

Author(s):

José Antonio Robles-Flores ◽

Gregory Schymik ◽

Julie Smith-David ◽

Robert St. Louis

Keyword(s):

Search Engines ◽

Web Site ◽

Web Search ◽

Question Answering ◽

Irrelevant Information ◽

Web Pages ◽

Crowd Sourcing ◽

Complete Answer ◽

Question Answering Systems

Web search engines typically retrieve a large number of web pages and overload business analysts with irrelevant information. One approach that has been proposed for overcoming some of these problems is automated Question Answering (QA). This paper describes a case study that was designed to determine the efficacy of QA systems for generating answers to original, fusion, list questions (questions that have not previously been asked and answered, questions for which the answer cannot be found on a single web site, and questions for which the answer is a list of items). Results indicate that QA algorithms are not very good at producing complete answer lists and that searchers are not very good at constructing answer lists from snippets. These findings indicate a need for QA research to focus on crowd sourcing answer lists and improving output format.

Download Full-text

Handling temporal information in web search engines

ACM SIGMOD Record ◽

10.1145/2380776.2380780 ◽

2012 ◽

Vol 41 (3) ◽

pp. 15-23 ◽

Cited By ~ 8

Author(s):

Edimar Manica ◽

Carina F. Dorneles ◽

Renata Renata Galante

Keyword(s):

Search Engines ◽

Web Search ◽

Temporal Information ◽

Web Search Engines

Download Full-text