A Study of XML Models for Data Mining

Data Mining ◽  
2013 ◽  
pp. 1-27
Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.

Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.


Data Mining ◽  
2013 ◽  
pp. 669-691 ◽  
Author(s):  
Evgeny Kharlamov ◽  
Pierre Senellart

This chapter deals with data mining in uncertain XML data models, whose uncertainty typically comes from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We focus on a specific probabilistic XML model, which allows representing arbitrary finite distributions of XML documents, and has been extended to also allow continuous distributions of data values. We summarize previous work on querying this uncertain data model and show how to apply the corresponding techniques to several data mining tasks, exemplified through use cases on two running examples.


Author(s):  
Evgeny Kharlamov ◽  
Pierre Senellart

This chapter deals with data mining in uncertain XML data models, whose uncertainty typically comes from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We focus on a specific probabilistic XML model, which allows representing arbitrary finite distributions of XML documents, and has been extended to also allow continuous distributions of data values. We summarize previous work on querying this uncertain data model and show how to apply the corresponding techniques to several data mining tasks, exemplified through use cases on two running examples.


Cricket is one of the most popular games in many countries. As many as 19 countries play cricket as their main game, and the number is likely to increase in the future. However, there are no suitable tools for analyzing pre-outcome of the match from beginning to end. Existing tools do not support simulating match using batting partnerships. The ultimate goal of predicting pre-outcome of a cricket match is to identify key players and their batting performances. It is also to prevent wrong players from selecting and toss decision by making statistical predictions. This research focuses on One Day International (ODI) cricket match and predicts the outcome of a particular match. Our solution consists of three major modules, namely, Web UI Module, CRIC-Win Analytic Engine, and Backend Data Module. CRIC-Win Analytic Engine has two sub data models, one for predicting the overall match outcome based on a given pre-match data, and the other for predicting match outcome based on batting partnership of both home and rival teams. All sub-models in the CRIC-Win Analytic Engine are developed using the Naïve Bayes algorithm for generating the classifier model, which is used to predict the outcome of the cricket match.


Author(s):  
Ang Jin Sheng Et.al

XML has numerous uses in a wide variety of web pages and applications. Some common uses of XML include tasks for web publishing, web searching and automation, and general application such as for utilize, store, transfer and display business process log data. The amount of information expressed in XML has gone up rapidly. Many works have been done on sensible approaches to address issues related to the handling and review of XML documents. Mining XML documents offera way to understand both the structure and the content of XML documents. A common approach capable of analysing XML documents is frequent subtree mining.Frequent subtree mining is one of the data mining techniques that finds the relationship between transactions in a tree structured database. Due to the structure and the content of XML format, traditional data mining and statistical analysis hardly applied to get accurate result. This paper proposes a framework that can flatten a tree structured data into a flat and structured data, while preserving their structure and content.Enabling these XML documents into relational structured data allows a range of data mining techniques and statistical test can be applied and conducted to extract more information from the business process log.


Author(s):  
Carey Goh ◽  
Henry M.K. Mok ◽  
Rob Law

The tourism industry has become one of the fastest growing industries in the world, with international tourism flows in year 2006 more than doubled since 1980. In terms of direct economic benefits, United Nations World Tourism Organization (UNWTO, 2007) estimated that the industry has generated US $735 billion through tourism in the year of 2006. Through multiplier effects, World Travel and Tourism Council (WTTC, 2007) estimated that tourism will generate economic activities worth of approximately US $5,390 billion in year 2007 (10.4% of world GDP). Owing to the important economic contribution by the tourism industry, researchers, policy makers, planners, and industrial practitioners have been trying to analyze and forecast tourism demand. The perishable nature of tourism products and services, the information-intensive nature of the tourism industry, and the long lead-time investment planning of equipment and infrastructures all render accurate forecasting of tourism demand necessary (Law, Mok, & Goh, 2007). Past studies have predominantly applied the well-developed econometric techniques to measure and predict the future market performance in terms of the number of tourist arrivals in a specific destination. In this chapter, we aim to present an overview of studies that have adopted artificial intelligence (AI) data-mining techniques in studying tourism demand forecasting. Our objective is to review and trace the evolution of such techniques employed in tourism demand studies since 1999, and based on our observations from the review, a discussion on the future direction of tourism research techniques and methods is then provided. Although the adoption of data mining techniques in tourism demand forecasting is still at its infancy stage, from the review, we identify certain research gaps, draw certain key observations, and discuss possible future research directions.


Author(s):  
Francisco Javier Villar Martín ◽  
Jose Luis Castillo Sequera ◽  
Miguel Angel Navarro Huerga

The quality of a company's information system is essential and also its physical data model. In this article, the authors apply data mining techniques in order to generate knowledge from the information system's data model, and also to discover and understand hidden patterns within data that regulate the planning of flight hours of pilots and copilots in an airline. With this objective, they use Weka free software which offers a set of algorithms and visualization tools geared to data analysis and predictive modeling of information systems. Firstly, they apply clustering to study the information system and analyze data model; secondly, they apply association rules to discover connection patterns in data; and finally, they generate a decision tree to classify and extract more specific patterns. The authors suggest conclusions according these information system's data to improve future decision making in an airline's flight hours assignments.


2021 ◽  
Vol 13 (3) ◽  
pp. 71-85
Author(s):  
Sunčica Rogić ◽  
Ljiljana Kašćelan

This paper seeks to compare certain customer segments from two sport footwear, apparel, and equipment retailers and to examine an objective market segmentation method, based on the recency, frequency, monetary (RFM) and the decision tree (DT) models. The case study is based on two data sets, aiming to compare the different customer segments, both from sport retail industry, and represents an application of data mining techniques in a business environment. The customer segmentation enables the customer selection for the future direct marketing campaigns based on the previous purchasing behavior. Analyzing the customers' purchasing history can help the company determine the value of each customer and therefore target or not target such customers in the future with promotional materials, based on both the customers' interests and their value. Thus, based on the results, personalized offers can be created for each of the defined customer groups, which may increase the efficiency of the overall campaign, reduce costs, and increase profitability.


Author(s):  
Edy Satria ◽  
Heru Satria Tambunan ◽  
Ilham Syahputra Saragih ◽  
Irfan Sudahri Damanik ◽  
Fany Than Ervina Sitanggang

The Indonesian tourism sector currently contributes approximately 4% of the total economy. In 2019, the Indonesian Government wants to increase this figure to double to 8% of PDB (Produk Domestik Bruto), a target that implies that within the next 4 years, the number of visitors needs to be doubled to approximately 20 million tourists. This study discusses the Application of Clustering in Grouping the Number of Foreign Tourist Visits by Nationality and Month of Arrival by the K-Means Method. The source of this research data was collected based on data on the number of foreign tourist visits produced by the National Statistics Agency. K-Means clustering is one of the data mining techniques that gives a description of an item's cluster. The purpose of this study is to classify the number of foreign tourists in Indonesia. The results of this study are grouping the number of foreign tourist visits grouped by two clusters (high and low), high clusters of 4 countries and low clusters of 87 countries. Countries that are included in the lower clusters can be used for the Government of Indonesia in terms of improving existing facilities in tourist attractions so that visiting tourists will increase in the future.


Author(s):  
Charles M. Eastman

Abstract Data modeling is proposed as a means to address the complexities of CAD/CAM databases. The distinctions between product data models and those in general databases are reviewed. A data model, called EDM, is presented that incorporates features that are defined to response to these differences. A review is given of data modeling analyses carried out thus far regarding database extensibility, including support for open-ended knowledge domains.


Sign in / Sign up

Export Citation Format

Share Document