Learning Information Extraction Rules for Web Data Mining

Author(s):  
Chia-Hui Chang ◽  
Chun-Nan Hsu

The explosive growth and popularity of the World Wide Web has resulted in a huge number of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and keyword searching. Sophisticated Web-mining applications, such as comparison shopping, require expensive maintenance costs to deal with different data formats. The problem in translating the contents of input documents into structured data is called information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and search tools.

2003 ◽  
Vol 92 (3_suppl) ◽  
pp. 1091-1096 ◽  
Author(s):  
Nobuhiko Fujihara ◽  
Asako Miura

The influences of task type on search of the World Wide Web using search engines without limitation of search domain were investigated. 9 graduate and undergraduate students studying psychology (1 woman and 8 men, M age = 25.0 yr., SD = 2.1) participated. Their performance to manipulate the search engines on a closed task with only one answer were compared with their performance on an open task with several possible answers. Analysis showed that the number of actions was larger for the closed task ( M = 91) than for the open task ( M = 46.1). Behaviors such as selection of keywords (averages were 7.9% of all actions for the closed task and 16.7% for the open task) and pressing of the browser's back button (averages were 40.3% of all actions for the closed task and 29.6% for the open task) were also different. On the other hand, behaviors such as selection of hyperlinks, pressing of the home button, and number of browsed pages were similar for both tasks. Search behaviors were influenced by task type when the students searched for information without limitation placed on the information sources.


2011 ◽  
pp. 203-212
Author(s):  
Luis V. Casaló ◽  
Carlos Flavián ◽  
Miguel Guinalíu

Individuals are increasingly turning to computermediated communication in order to get information on which to base their decisions. For instance, many consumers are using newsgroups, chat rooms, forums, e-mail list servers, and other online formats to share ideas, build communities and contact other consumers who are seen as more objective information sources (Kozinets, 2002). These social groups have been traditionally called virtual communities. The virtual community concept is almost as old as the concept of Internet. However, the exponential development of these structures occurred during the nineties (Flavián & Guinalíu, 2004) due to the appearance of the World Wide Web and the spreading of other Internet tools such as e-mail or chats. The justification of this expansion is found in the advantages generated by the virtual communities to both the members and the organizations that create them.


1996 ◽  
Vol 14 (7) ◽  
pp. 2181-2186 ◽  
Author(s):  
L M Glodé

PURPOSE The internet, and in particular the world wide web (www), has a rapidly increasing potential to provide information for oncologists and their patients about cancer biology and treatment. A brief overview of this environment is given along with examples of how easily the information is accessed as a means of introducing the web page of the American Society of Clinical Oncology (ASCO), ASCO OnLine. METHODS Oncology information sources on the www were accessed from the author's home using a 14.4 kbs modem, Netscape browser (Netscape communications Corp, Mountain View, CA), and the locations recorded for tabulation and discussion. RESULTS Overwhelming amounts of oncology-related information are now available via the Internet. CONCLUSION Oncology as a subspecialty is ideally suited to apply the newest information technology to traditional needs in areas of education, research, and patient care. Oncologists will increasingly act as information guides rather than information resources for patients and their families with cancer.


2013 ◽  
Vol 60 (1) ◽  
pp. 42-53 ◽  
Author(s):  
Alexandru Napoleon Sireteanu

Abstract In the beginning World Wide Web was syntactic and the content itself was only readable by humans. The modern web combines existing web technologies with knowledge representation formalisms. In this sense, the Semantic Web proposes the mark-up of content on the web using formal ontology that structure essential data for the purpose of comprehensive machine understanding. On the syntactical level, standardization is an important topic. Many standards which can be used to integrate different information sources have evolved. Beside the classical database interfaces like ODBC, web-oriented standard languages like HTML, XML, RDF and OWL increase in importance. As the World Wide Web offers the greatest potential for sharing information, we will base our paper on these evolving standards.


Author(s):  
Mu-Chun Su ◽  
◽  
Shao-Jui Wang ◽  
Chen-Ko Huang ◽  
Pa-ChunWang ◽  
...  

Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a <u>si</u>gnal-<u>r</u>epresentation-b<u>a</u>sed <u>p</u>arser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach.


1999 ◽  
Vol 9 (3) ◽  
pp. 451-454
Author(s):  
M.P. Garber ◽  
K. Bondari

Results of a national survey indicated that the top four sources of information used by garden writers for new or appropriate plants were nursery catalogs, botanical and public gardens, seed company catalogs, and gardening magazines. More than 50% of the participating garden writers reportedly used these four sources a lot. The most frequently used books and magazines were Horticulture Magazine (34.6%), Manual of Woody Landscape Plants (24.1%), and Fine Gardening (23.7%). About 29% of the garden writers used the World Wide Web to source information and the two most widely used type of sites were universities and botanical gardens and arboreta. A high percentage of garden writers desire greater or more frequent communications with botanical gardens and arboreta (90.4%), university personnel (87.4%), and plant producers (86.3%).


Sign in / Sign up

Export Citation Format

Share Document