A Method of Web Information Automatic Extraction Based on XML

2010 ◽  
Vol 20-23 ◽  
pp. 178-183
Author(s):  
Jun Hua Gu ◽  
Jie Song ◽  
Na Zhang ◽  
Yan Liu Liu

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.

2014 ◽  
Vol 989-994 ◽  
pp. 4322-4325
Author(s):  
Mu Qing Zhan ◽  
Rong Hua Lu

In the means of getting information from the Internet, the Web information extraction technology which can get more precise and more granular information is different from Search Engine, this article presents the technical route of Web information exaction of ceramic products’ information on the basis of analyzing the developing status of Web information extraction technology at home and abroad, and makes the extraction rules, and develops a set of extraction system, and acquires the relevant ceramic products’ information.


2014 ◽  
Vol 614 ◽  
pp. 503-506
Author(s):  
Qi Shen ◽  
Qing Ming Song ◽  
Bo Chen

With the development of web technology, the use of dynamic web pages and the personalization of page contents become more and more popular. Currently, the information of page is protean and the structures of different pages are vastly different, the traditional thinking of web information extraction technology has been difficult to adapt to the situation. In this paper, proposes a web information extraction method based on extended XPath policy through the analysis of structural features of web pages on tourist theme. This algorithm avoids the defects of traditional web information extraction technology; it is simple, practical, high cleaning efficiency, accuracy, and saving the overhead of the system.


2021 ◽  
Vol 33 (3) ◽  
pp. 87-100
Author(s):  
Denis Eyzenakh ◽  
Anton Rameykov ◽  
Igor Nikiforov

Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.


2008 ◽  
pp. 469-484
Author(s):  
David Camacho ◽  
Ricardo Aler ◽  
Juan Cuadrado

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.


Author(s):  
David Camacho ◽  
Ricardo Aler ◽  
Juan Cuadrado

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.


2019 ◽  
Vol 8 (2) ◽  
pp. 5275-5280

The rapid growth of web information and its services in different areas such as e-commerce, healthcare, digital marketing, online booking, etc. is a challenge in providing accurate information in the domain services related to the user's query. The current web information of services classifies the retrieval of the relevant service and assists the classification by supporting the knowledge and classifications of the specific service information. Because of these limitations and the complexity of automatic update mechanisms to see this service information, a large number of non-related service information for a requested query, and getting the required web information of services is a cumbersome problem. This paper proposes Semantic based Terms Relation Approach (STRA) for classifying information for effective classification of WIS on the web. The approach utilize Concept Terms Similarity (CTS) method for the most relevant terms in a service domain and construct a Related Terms Hierarchal Model (RTHM), which will be used for classification. A modified Naive Bayes classifier is used to perform the classification of the web information of services using RTHM, to categorize and present accurately. The experiment evaluation of the proposed approach shows an improvement in the classification of information and achieve a highly related matching results against different number of users queries.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


2021 ◽  
Vol 7 ◽  
pp. 237802312098820
Author(s):  
Thurston Domina ◽  
Linda Renzulli ◽  
Brittany Murray ◽  
Alma Nidia Garza ◽  
Lysandra Perez

Using data from a spring 2020 survey of nearly 10,000 parents of elementary school parents in one large southeastern public school district, the authors investigate predictors of elementary school student engagement during the initial period of pandemic remote learning. The authors hypothesize that household material and technological resources, school programming and instructional strategies, and family social capital contribute to student engagement in remote learning. The analyses indicate that even after controlling for rich measures of family socioeconomic resources, students with access to high-speed Internet and Internet-enabled devices have higher levels of engagement. Exposure to more diverse socioemotional and academic learning opportunities further predicts higher levels of engagement. In addition, students whose families remained socially connected to other students’ families were more likely to engage online.


Sign in / Sign up

Export Citation Format

Share Document