Automatic generation of agents for collecting hidden Web pages for data extraction

2004 ◽  
Vol 49 (2) ◽  
pp. 177-196 ◽  
Author(s):  
Juliano Palmieri Lage ◽  
Altigran S. da Silva ◽  
Paulo B. Golgher ◽  
Alberto H.F. Laender
2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


1999 ◽  
Vol 31 (3) ◽  
pp. 227-251 ◽  
Author(s):  
D.W. Embley ◽  
D.M. Campbell ◽  
Y.S. Jiang ◽  
S.W. Liddle ◽  
D.W. Lonsdale ◽  
...  

Author(s):  
Shalin Hai-Jew

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.


Author(s):  
Xiaoying Gao ◽  
Leon Sterling

The World Wide Web is known as the “universe of network-accessible information, the embodiment of human knowledge” (W3C, 1999). Internet-based knowledge management aims to use the Internet as the world wide environment for knowledge publishing, searching, sharing, reusing, and integration, and to support collaboration and decision making. However, knowledge on the Internet is buried in documents. Most of the documents are written in languages for human readers. The knowledge contained therein cannot be easily accessed by computer programs such as knowledge management systems. In order to make the Internet “machine readable,” information extraction from Web pages becomes a crucial research problem.


2018 ◽  
Vol 34 (6) ◽  
pp. 537-546 ◽  
Author(s):  
Miriam Luhnen ◽  
Barbara Prediger ◽  
Edmund A.M. Neugebauer ◽  
Tim Mathes

Objectives:When making decisions in health care, it is essential to consider economic evidence about an intervention. The objective of this study was to analyze the methods applied for systematic reviews of health economic evaluations (SR-HEs) in HTA and to identify common challenges.Methods:We manually searched the Web pages of HTA organizations and included HTA-reports published since 2015. Prerequisites for inclusion were the conduct of an SR-HE in at least one electronic database and the use of the English, German, French, or Spanish language. Methodological features were extracted in standardized tables. We prepared descriptive statistical (e.g., median, range) measures to describe the applied methods. Data were synthesized in a structured narrative way.Results:Eighty-three reports were included in the analysis. We identified inexplicable heterogeneity, particularly concerning literature search strategy, data extraction, assessment of quality, and applicability. Furthermore, process steps were often missing or reported in a nontransparent way. The use of a standardized data extraction form was indicated in one-third of reports (32 percent). Fifty-four percent of authors systematically appraised included studies. In 10 percent of reports, the applicability of included studies was assessed. Involvement of two reviewers was rarely reported for the study selection (43 percent), data extraction (28 percent), and quality assessment (39 percent).Conclusions:The methods applied for SR-HEs in HTA and their reporting quality are very heterogeneous. Efforts toward a detailed, standardized guidance for the preparation of SR-HEs definitely seem necessary. A general harmonization and improvement of the applied methodology would increase the value of SR-HE for decision makers.


Author(s):  
MOHAMMAD SHAFKAT AMIN ◽  
HASAN JAMIL

In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time.


Sign in / Sign up

Export Citation Format

Share Document