Advances in Data Mining and Database Management - XML Data Mining
Latest Publications


TOTAL DOCUMENTS

19
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By IGI Global

9781613503560, 9781613503577

Author(s):  
Novita Ikasari ◽  
Fedja Hadzic ◽  
Tharam S. Dillon

Credit risk assessment has been one of the most appealing topics in banking and finance studies, attracting both scholars’ and practitioners’ attention for some time. Following the success of the Grameen Bank, works on credit risk, in particular for Small Medium Enterprises (SMEs), have become essential. The distinctive character of SMEs requires a method that takes into account quantitative and qualitative information for loan granting decision purposes. In this chapter, we first provide a survey of existing credit risk assessment methods, which shows a current gap in the existing research in regards to taking qualitative information into account during the data mining process. To address this shortcoming, we propose a framework that utilizes an XML-based template to capture both qualitative and quantitative information in this domain. By representing this information in a domain-oriented way, the potential knowledge that can be discovered for evidence-based decision support will be maximized. An XML document can be effectively represented as a rooted ordered labelled tree and a number of tree mining methods exist that enable the efficient discovery of associations among tree-structured data objects, taking both the content and structure into account. The guidelines for correct and effective application of such methods are provided in order to gain detailed insight into the information governing the decision making process. We have obtained a number of textual reports from the banks regarding the information collected from SMEs during the credit application/evaluation process. These are used as the basis for generating a synthetic XML database that partially reflects real-world scenarios. A tree mining method is applied to this data to demonstrate the potential of the proposed method for credit risk assessment.


Author(s):  
Pasquale De Meo ◽  
Antonino Nocera ◽  
Domenico Ursino

Handling the interoperability issues in multiple, heterogeneous XML sources is central in XML data management and mining. In this chapter, we present a framework for the intensional integration and exploration of XML sources. Specifically, we propose a three-layer framework aimed at extracting interschema knowledge from the available sources, constructing a hierarchy based on the extracted knowledge to represent the sources at different abstraction levels, and finally organizing and exploring the sources through the constructed hierarchy. We also describe possible implementations of each of the three layers, focusing on the extraction of intensional interschema properties, the intensional integration of XML sources, and the clustering of XML schemas. In order to better handle the complexity of its activities, the proposed framework has been designed by means of the layers architecture patterns and the component-based development paradigm.


Author(s):  
Francesco Gullo ◽  
Giovanni Ponti ◽  
Sergio Greco

In this chapter we address the problem of clustering XML documents in a collaborative distributed environment. We developed a clustering framework for XML sources distributed on a P2P network. XML documents are modeled based on a transactional representation which uses both XML structure and content information. The clustering method employs a centroid-based partitional scheme suitably adapted to work on a P2P network. Each peer is enabled to compute a clustering solution over its local repository and to exchange the resulting cluster representatives with the other peers. The exchanged cluster representatives are hence used to compute the global clustering solution in a collaborative way. Effectiveness and efficiency of the framework were evaluated on real XML document collections varying the number of peers. Experimental results have shown significant improvements of our collaborative distributed algorithm with respect to the centralized clustering setting in terms of execution time, achieving clustering solutions that still remain accurate with a moderately low number of nodes in the network.


Author(s):  
Rafael Berlanga ◽  
Victoria Nebot

This chapter describes the convergence of two influential technologies in the last decade, namely data mining (DM) and the Semantic Web (SW). The wide acceptance of new SW formats for describing semantics-aware and semistructured contents have spurred on the massive generation of semantic annotations and large-scale domain ontologies for conceptualizing their concepts. As a result, a huge amount of both knowledge and semantic-annotated data is available in the web. DM methods have been very successful in discovering interesting patterns which are hidden in very large amounts of data. However, DM methods have been largely based on simple and flat data formats which are far from those available in the SW. This chapter reviews and discusses the main DM approaches proposed so far to mine SW data as well as those that have taken into account the SW resources and tools to define semantics-aware methods.


Author(s):  
Panagiotis Antonellis

The wide use of XML as the de facto standard of storing and exchanging information through Internet has led a wide spectrum of heterogeneous applications to adopt XML as their information representation model. The heterogeneity of XML data sources has brought up the problem of efficiently clustering a set of XML documents. However, traditional clustering algorithms cannot be applied due to the semistructured nature of XML, which contains both structure and content features. Hence, special techniques should be used that would take into account the XML semantics in order to address the problem of XML clustering. The described approaches, based on either the structure or the content or both, manage to successfully address the problem and can be applied efficiently in real-world applications.


Author(s):  
Pasquale De Meo ◽  
Giacomo Fiumara ◽  
Antonino Nocera ◽  
Domenico Ursino

In recent years, there has been an increase in the volume and heterogeneity of XML data sources. Moreover, these information sources are often comprised of both schemas and instances of XML data. In this context, the need of grouping similar XML documents together has led to an increasing research on clustering algorithms for XML data. In this chapter, we present an overview of the most popular methods for clustering XML data sources, distinguishing between the intensional data level and the extensional data level, depending whether the sources to cluster are DTDs and XML schemas, or XML documents; in the latter case, we focus on the structural information of the documents. We classify and describe techniques for computing similarities among XML data sources, and discuss methods for clustering DTDs/XML schemas and XML documents.


Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.


Author(s):  
Mirjana Mazuran ◽  
Elisa Quintarelli ◽  
Angelo Rauseo ◽  
Letizia Tanca

In this work we describe the TreeRuler tool, which makes it possible for inexperienced users to access huge XML (or relational) datasets. TreeRuler encompasses two main features: (1) it mines all the frequent association rules from input documents without any a-priori specification of the desired results, and (2) it provides quick, summarized, thus often approximate answers to user’s queries, by using the previously mined knowledge. TreeRuler has been developed in the scenario of the Odyssey EU project dealing with information about crimes, both for the relational and XML data model. In this chapter we mainly focus on the objectives, strategies, and difficulties encountered in the XML context.


Author(s):  
Qin Ding ◽  
Gnanasekaran Sundarraj

Finding frequent patterns and association rules in large data has become a very important task in data mining. Various algorithms have been proposed to solve such problems, but most algorithms are only applicable to relational data. With the increasing use and popularity of XML representation, it is of importance yet challenging to find solutions to frequent pattern discovery and association rule mining of XML data. The challenge comes from the complexity of the structure in XML data. In this chapter, we provide an overview of the state-of-the-art research in content-based and structure-based mining of frequent patterns and association rules from XML data. We also discuss the challenges and issues, and provide our insight for solutions and future research directions.


Author(s):  
Albert Bifet ◽  
Ricard Gavaldà

Nowadays, advanced analysis of data streams is quickly becoming a key area of data mining research, as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is, when concepts drift or change completely, is becoming one of the core issues. At the same time, closure-based mining on relational data has recently provided some interesting algorithmic developments as well as practical uses. In this chapter we show how to use closure-based mining to reduce drastically the number of attributes in XML tree classification tasks. Moreover, using maximal frequent trees, we reduce even more the number of attributes needed in tree classification, in many cases without losing accuracy. We show a general framework to classify XML trees using subtree occurrence, composing a Tree XML Closed Frequent Miner with a classifier algorithm. We present specific methods that can adaptively mining closed patterns from data streams that change over time.


Sign in / Sign up

Export Citation Format

Share Document