Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.

Download Full-text

A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition

PLoS ONE ◽

10.1371/journal.pone.0162702 ◽

2016 ◽

Vol 11 (9) ◽

pp. e0162702 ◽

Cited By ~ 4

Author(s):

Muhammad Hameed Siddiqi ◽

Md. Golam Rabiul Alam ◽

Choong Seon Hong ◽

Adil Mehmood Khan ◽

Hyunseung Choo

Keyword(s):

Facial Expression ◽

Markov Model ◽

Maximum Entropy ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Human Facial Expression

Download Full-text

A Framework to Analyze Business Process Log in XML Format

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.1264 ◽

2021 ◽

Vol 12 (3) ◽

pp. 2623-2630

Author(s):

Ang Jin Sheng Et.al

Keyword(s):

Data Mining ◽

Business Process ◽

Accurate Result ◽

Structured Data ◽

Web Pages ◽

Web Searching ◽

General Application ◽

Data Mining Techniques ◽

Xml Documents ◽

The Relationship

XML has numerous uses in a wide variety of web pages and applications. Some common uses of XML include tasks for web publishing, web searching and automation, and general application such as for utilize, store, transfer and display business process log data. The amount of information expressed in XML has gone up rapidly. Many works have been done on sensible approaches to address issues related to the handling and review of XML documents. Mining XML documents offera way to understand both the structure and the content of XML documents. A common approach capable of analysing XML documents is frequent subtree mining.Frequent subtree mining is one of the data mining techniques that finds the relationship between transactions in a tree structured database. Due to the structure and the content of XML format, traditional data mining and statistical analysis hardly applied to get accurate result. This paper proposes a framework that can flatten a tree structured data into a flat and structured data, while preserving their structure and content.Enabling these XML documents into relational structured data allows a range of data mining techniques and statistical test can be applied and conducted to extract more information from the business process log.

Download Full-text

A motifs-based Maximum Entropy Markov Model for realtime reliability prediction in System of Systems

Journal of Systems and Software ◽

10.1016/j.jss.2019.02.023 ◽

2019 ◽

Vol 151 ◽

pp. 180-193 ◽

Cited By ~ 3

Author(s):

Hongbing Wang ◽

Huanhuan Fei ◽

Qi Yu ◽

Wei Zhao ◽

Jia Yan ◽

...

Keyword(s):

Markov Model ◽

Maximum Entropy ◽

System Of Systems ◽

Reliability Prediction

Download Full-text

Automatic Recognition of Human Interaction via Hybrid Descriptors and Maximum Entropy Markov Model Using Depth Sensors

Entropy ◽

10.3390/e22080817 ◽

2020 ◽

Vol 22 (8) ◽

pp. 817 ◽

Cited By ~ 4

Author(s):

Ahmad Jalal ◽

Nida Khalid ◽

Kibum Kim

Keyword(s):

Markov Model ◽

Maximum Entropy ◽

Social Activity ◽

Gaussian Mixture ◽

Features Extraction ◽

Human Interaction ◽

Cross Entropy ◽

Automatic Identification ◽

Depth Sensors ◽

Entropy Optimization

Automatic identification of human interaction is a challenging task especially in dynamic environments with cluttered backgrounds from video sequences. Advancements in computer vision sensor technologies provide powerful effects in human interaction recognition (HIR) during routine daily life. In this paper, we propose a novel features extraction method which incorporates robust entropy optimization and an efficient Maximum Entropy Markov Model (MEMM) for HIR via multiple vision sensors. The main objectives of proposed methodology are: (1) to propose a hybrid of four novel features—i.e., spatio-temporal features, energy-based features, shape based angular and geometric features—and a motion-orthogonal histogram of oriented gradient (MO-HOG); (2) to encode hybrid feature descriptors using a codebook, a Gaussian mixture model (GMM) and fisher encoding; (3) to optimize the encoded feature using a cross entropy optimization function; (4) to apply a MEMM classification algorithm to examine empirical expectations and highest entropy, which measure pattern variances to achieve outperformed HIR accuracy results. Our system is tested over three well-known datasets: SBU Kinect interaction; UoL 3D social activity; UT-interaction datasets. Through wide experimentations, the proposed features extraction algorithm, along with cross entropy optimization, has achieved the average accuracy rate of 91.25% with SBU, 90.4% with UoL and 87.4% with UT-Interaction datasets. The proposed HIR system will be applicable to a wide variety of man–machine interfaces, such as public-place surveillance, future medical applications, virtual reality, fitness exercises and 3D interactive gaming.

Download Full-text

A Maximum Entropy Markov Model for Prediction of Prosodic Phrase Boundaries in Chinese TTS

2007 IEEE International Conference on Granular Computing (GRC 2007) ◽

10.1109/grc.2007.4403149 ◽

2007 ◽

Author(s):

Ziping Zhao ◽

Ti Zhao ◽

Yaoting Zhu

Keyword(s):

Markov Model ◽

Maximum Entropy ◽

Prosodic Phrase

Download Full-text

A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics - MDS '12 ◽

10.1145/2350190.2350199 ◽

2012 ◽

Cited By ~ 4

Author(s):

Dandan Song ◽

Yunpeng Wu ◽

Lejian Liao ◽

Long Li ◽

Fei Sun

Keyword(s):

Structured Data ◽

Web Pages ◽

Dynamic Learning ◽

Learning Framework

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Enrichment of Remote Homology Detection using Cascading Maximum Entropy Markov Model

International Journal of Current Research and Review ◽

10.31782/ijcrr.2021.131906 ◽

2021 ◽

Vol 13 (19) ◽

pp. 80-84

Author(s):

Manikandan P ◽

Ramyachitra D ◽

Muthu C ◽

Sajithra N

Keyword(s):

Markov Model ◽

Maximum Entropy ◽

Homology Detection ◽

Remote Homology ◽

Remote Homology Detection

Download Full-text

Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

Application Study of Hidden Markov Model and Maximum Entropy in Text Information Extraction

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition

A Framework to Analyze Business Process Log in XML Format

A motifs-based Maximum Entropy Markov Model for realtime reliability prediction in System of Systems

Automatic Recognition of Human Interaction via Hybrid Descriptors and Maximum Entropy Markov Model Using Depth Sensors

A Maximum Entropy Markov Model for Prediction of Prosodic Phrase Boundaries in Chinese TTS

A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Enrichment of Remote Homology Detection using Cascading Maximum Entropy Markov Model

Export Citation Format