A Method of Web Information Automatic Extraction Based on XML

In the means of getting information from the Internet, the Web information extraction technology which can get more precise and more granular information is different from Search Engine, this article presents the technical route of Web information exaction of ceramic products’ information on the basis of analyzing the developing status of Web information extraction technology at home and abroad, and makes the extraction rules, and develops a set of extraction system, and acquires the relevant ceramic products’ information.

Download Full-text

Developing Innovative Web Information Systems through the Use of Web Data Extraction Technology

19th International Conference on Advanced Information Networking and Applications (AINA'05) Volume 1 (AINA papers) ◽

10.1109/aina.2005.161 ◽

2005 ◽

Author(s):

N.A. Yahaya ◽

Goh Poh Gin ◽

Chan Wai Choon

Keyword(s):

Information Systems ◽

Data Extraction ◽

Web Data ◽

Extraction Technology ◽

Web Data Extraction ◽

Web Information ◽

Web Information Systems

Download Full-text

Research of the Web Information Extraction Technology on Tourism Theme

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.614.503 ◽

2014 ◽

Vol 614 ◽

pp. 503-506

Author(s):

Qi Shen ◽

Qing Ming Song ◽

Bo Chen

Keyword(s):

Information Extraction ◽

Extraction Method ◽

Structural Features ◽

Web Pages ◽

Cleaning Efficiency ◽

Extraction Technology ◽

Web Information Extraction ◽

Web Information ◽

Dynamic Web ◽

The Web

With the development of web technology, the use of dynamic web pages and the personalization of page contents become more and more popular. Currently, the information of page is protean and the structures of different pages are vastly different, the traditional thinking of web information extraction technology has been difficult to adapt to the situation. In this paper, proposes a web information extraction method based on extended XPath policy through the analysis of structural features of web pages on tourist theme. This algorithm avoids the defects of traditional web information extraction technology; it is simple, practical, high cleaning efficiency, accuracy, and saving the overhead of the system.

Download Full-text

High Performance Distributed Web-Scraper

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(3)-7 ◽

2021 ◽

Vol 33 (3) ◽

pp. 87-100

Author(s):

Denis Eyzenakh ◽

Anton Rameykov ◽

Igor Nikiforov

Keyword(s):

Data Storage ◽

High Performance ◽

Data Extraction ◽

Source Code ◽

The Internet ◽

Distinctive Features ◽

The Past ◽

Web Information ◽

Machine Leaning ◽

The Web

Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.

Download Full-text

Rule-Based Parsing for Web Data Extraction

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch034 ◽

2008 ◽

pp. 469-484

Author(s):

David Camacho ◽

Ricardo Aler ◽

Juan Cuadrado

Keyword(s):

Data Extraction ◽

Research Field ◽

Multi Agent System ◽

Agent System ◽

Web Based ◽

Web Information ◽

Multi Agent ◽

Stored Information ◽

The Web ◽

Web Systems

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.

Download Full-text

Rule-Based Parsing for Web Data Extraction

Intelligent Agents for Data Mining and Information Retrieval ◽

10.4018/978-1-59140-194-0.ch005 ◽

2004 ◽

pp. 65-87 ◽

Cited By ~ 1

Author(s):

David Camacho ◽

Ricardo Aler ◽

Juan Cuadrado

Keyword(s):

Data Extraction ◽

Research Field ◽

Multi Agent System ◽

Agent System ◽

Web Based ◽

Web Information ◽

Multi Agent ◽

Stored Information ◽

The Web ◽

Web Systems

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.

Download Full-text

Analysis and Improvement of Data Extraction Technology on the Web

2010 2nd International Conference on E-business and Information System Security ◽

10.1109/ebiss.2010.5473712 ◽

2010 ◽

Author(s):

Bi Li

Keyword(s):

Data Extraction ◽

Extraction Technology ◽

The Web

Download Full-text

A Semantic based Terms Relation Research for Classification in Web Information Mining

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1060.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 5275-5280

Keyword(s):

Digital Marketing ◽

Accurate Information ◽

Bayes Classifier ◽

Specific Service ◽

Web Information ◽

Service Information ◽

Related Service ◽

Relation Approach ◽

The Web

The rapid growth of web information and its services in different areas such as e-commerce, healthcare, digital marketing, online booking, etc. is a challenge in providing accurate information in the domain services related to the user's query. The current web information of services classifies the retrieval of the relevant service and assists the classification by supporting the knowledge and classifications of the specific service information. Because of these limitations and the complexity of automatic update mechanisms to see this service information, a large number of non-related service information for a requested query, and getting the required web information of services is a cumbersome problem. This paper proposes Semantic based Terms Relation Approach (STRA) for classifying information for effective classification of WIS on the web. The approach utilize Concept Terms Similarity (CTS) method for the most relevant terms in a service domain and construct a Related Terms Hierarchal Model (RTHM), which will be used for classification. A modified Naive Bayes classifier is used to perform the classification of the web information of services using RTHM, to categorize and present accurately. The experiment evaluation of the proposed approach shows an improvement in the classification of information and achieve a highly related matching results against different number of users queries.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Remote or Removed: Predicting Successful Engagement with Online Learning during COVID-19

Socius Sociological Research for a Dynamic World ◽

10.1177/2378023120988200 ◽

2021 ◽

Vol 7 ◽

pp. 237802312098820

Author(s):

Thurston Domina ◽

Linda Renzulli ◽

Brittany Murray ◽

Alma Nidia Garza ◽

Lysandra Perez

Keyword(s):

Elementary School ◽

Student Engagement ◽

High Speed ◽

Initial Period ◽

Learning Opportunities ◽

Academic Learning ◽

Socioeconomic Resources ◽

Using Data ◽

School Programming ◽

Remote Learning

Using data from a spring 2020 survey of nearly 10,000 parents of elementary school parents in one large southeastern public school district, the authors investigate predictors of elementary school student engagement during the initial period of pandemic remote learning. The authors hypothesize that household material and technological resources, school programming and instructional strategies, and family social capital contribute to student engagement in remote learning. The analyses indicate that even after controlling for rich measures of family socioeconomic resources, students with access to high-speed Internet and Internet-enabled devices have higher levels of engagement. Exposure to more diverse socioemotional and academic learning opportunities further predicts higher levels of engagement. In addition, students whose families remained socially connected to other students’ families were more likely to engage online.

Download Full-text