Analysis and Improvement of Data Extraction Technology on the Web

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Bibliographic Data Extraction from the Web Using Fuzzy-Based Techniques

Applications of Soft Computing for the Web ◽

10.1007/978-981-10-7098-3_7 ◽

2017 ◽

pp. 101-117

Author(s):

Tasleem Arif ◽

Rashid Ali

Keyword(s):

Data Extraction ◽

Bibliographic Data ◽

The Web

Download Full-text

Exploiting Enriched Knowledge of Web Network Structures

Enhancing Qualitative and Mixed Methods Research with Technology - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-6493-7.ch011 ◽

2015 ◽

pp. 255-286

Author(s):

Shalin Hai-Jew

Keyword(s):

Social Media ◽

Data Extraction ◽

Web Pages ◽

Network Structures ◽

Social Media Platform ◽

Special Software ◽

Testing Tool ◽

Media Platform ◽

Data Visualizations ◽

The Web

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.

Download Full-text

How did the data extraction business model come to dominate? Changes in the web use ecosystem before mobiles surpassed personal computers

The Information Society ◽

10.1080/01972243.2019.1644409 ◽

2019 ◽

pp. 1-14

Author(s):

Angela Xiao Wu ◽

Harsh Taneja

Keyword(s):

Business Model ◽

Data Extraction ◽

Personal Computers ◽

Web Use ◽

The Web

Download Full-text

Design and Implementation of Web Extraction System of Ceramic Products’ Information in the Business Website

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.4322 ◽

2014 ◽

Vol 989-994 ◽

pp. 4322-4325

Author(s):

Mu Qing Zhan ◽

Rong Hua Lu

Keyword(s):

Information Extraction ◽

Search Engine ◽

Extraction System ◽

The Internet ◽

Extraction Technology ◽

Web Information Extraction ◽

Design And Implementation ◽

Web Extraction ◽

Web Information ◽

The Web

In the means of getting information from the Internet, the Web information extraction technology which can get more precise and more granular information is different from Search Engine, this article presents the technical route of Web information exaction of ceramic products’ information on the basis of analyzing the developing status of Web information extraction technology at home and abroad, and makes the extraction rules, and develops a set of extraction system, and acquires the relevant ceramic products’ information.

Download Full-text

Distributed Wireless Measurement System for Transient Pressure with Data Extraction Technology

International Journal of Distributed Sensor Networks ◽

10.1155/2015/130817 ◽

2015 ◽

Vol 11 (6) ◽

pp. 130817 ◽

Cited By ~ 1

Author(s):

Wenlian Wang ◽

Jinwen Zhang ◽

Zhijie Zhang

Keyword(s):

Measurement System ◽

Data Extraction ◽

Transient Pressure ◽

Extraction Technology ◽

Wireless Measurement

Download Full-text

Prevalence of Invasive Streptococcus pneumoniae Infections among Iranian Children: A Systematic Review and Meta-Analysis

Iranian Journal of Public Health ◽

10.18502/ijph.v50i6.6412 ◽

2021 ◽

Author(s):

Arshid Yousefi Avarvand ◽

Mehrdad Halaji ◽

Donya Zare ◽

Meysam Hasannejad-Bibalan ◽

Hadi Sedigh Ebrahim-Saraie

Keyword(s):

Systematic Review ◽

Streptococcus Pneumoniae ◽

Web Of Science ◽

Data Extraction ◽

Meta Analysis ◽

Single Center ◽

Inclusion Criteria ◽

Pooled Prevalence ◽

Iranian Children ◽

The Web

Background: Streptococcus pneumoniae is an important pathogen of children, mostly in developing countries. We aimed to investigate the prevalence of invasive S. pneumoniae among Iranian children using a systematic review and meta-analysis. Methods: A systematic search was carried out to identify papers published by Iranian authors in the Web of Science, PubMed, Scopus, and Google Scholar electronic databases from January of 2010 to December of 2017. Then, seven publications that met our inclusion criteria were selected for data extraction and analysis. Results: Totally, one study was multicenter, and six were single-center based studies. Meanwhile, all of the included studied performed among hospitalized patients. Seven studies reported the prevalence of invasive S. pneumoniae isolated from children, of these the pooled prevalence of S. pneumoniae was 2.5% (95% CI: 0.7%-9.1%). Conclusion: The overall prevalence of invasive S. pneumoniae infections among Iranian children is low (2.5%). However, further clinical studies are required to elucidate the burden of infections among Iranian children, especially in eastern regions.

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

MalayIK: An Ontological Approach to Knowledge Transformation in Malay Unstructured Documents

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i1.pp1-10 ◽

2018 ◽

Vol 8 (1) ◽

pp. 1 ◽

Cited By ~ 1

Author(s):

Fatimah Sidi ◽

Iskandar Ishak ◽

Marzanah A. Jabar

Keyword(s):

Knowledge Base ◽

Data Extraction ◽

Knowledge Extraction ◽

Experimental Results ◽

New Approach ◽

Knowledge Transformation ◽

Ontological Approach ◽

Automatic Systems ◽

Existing Data ◽

The Web

The number of unstructured documents written in Malay language is enormously available on the web and intranets. However, unstructured documents cannot be queried in simple ways, hence the knowledge contained in such documents can neither be used by automatic systems nor could be understood easily and clearly by humans. This paper proposes a new approach to transform extracted knowledge in Malay unstructured document using ontology by identifying, organizing, and structuring the documents into an interrogative structured form. A Malay knowledge base, the MalayIK corpus is developed and used to test the MalayIK-Ontology against Ontos, an existing data extraction engine. The experimental results from MalayIK-Ontology have shown a significant improvement of knowledge extraction over Ontos implementation. This shows that clear knowledge organization and structuring concept is able to increase understanding, which leads to potential increase in sharable and reusable of concepts among the community.

Download Full-text