scholarly journals Modeling, Querying, and Mining Uncertain XML Data

Author(s):  
Evgeny Kharlamov ◽  
Pierre Senellart

This chapter deals with data mining in uncertain XML data models, whose uncertainty typically comes from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We focus on a specific probabilistic XML model, which allows representing arbitrary finite distributions of XML documents, and has been extended to also allow continuous distributions of data values. We summarize previous work on querying this uncertain data model and show how to apply the corresponding techniques to several data mining tasks, exemplified through use cases on two running examples.

Data Mining ◽  
2013 ◽  
pp. 669-691 ◽  
Author(s):  
Evgeny Kharlamov ◽  
Pierre Senellart

This chapter deals with data mining in uncertain XML data models, whose uncertainty typically comes from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We focus on a specific probabilistic XML model, which allows representing arbitrary finite distributions of XML documents, and has been extended to also allow continuous distributions of data values. We summarize previous work on querying this uncertain data model and show how to apply the corresponding techniques to several data mining tasks, exemplified through use cases on two running examples.


Data Mining ◽  
2013 ◽  
pp. 1-27
Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.


Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.


2017 ◽  
Author(s):  
Antoine Amarilli ◽  
Pierre Senellart

A number of uncertain data models have been proposed,based on the notion of compact representations of probability distributionsover possible worlds. In probabilistic relational models, tuples areannotated with probabilities or formulae over Boolean random variables.In probabilistic XML models, XML trees are augmented with nodesthat specify probability distributions over their children. Both kinds ofmodels have been extensively studied, with respect to their expressivepower, compactness, and query efficiency, among other things. Probabilisticdatabase systems have also been implemented, in both relationaland XML settings. However, these studies have mostly been carried outindependently and the translations between relational and XML models,as well as the impact for probabilistic relational databases of resultsabout query complexity in probabilistic XML and vice versa, have notbeen made explicit: we detail such translations in this article, in bothdirections, study their impact in terms of complexity results, and presentinteresting open issues about the connections between relational andXML probabilistic data models.


Author(s):  
L. Liu ◽  
S. Zlatanova ◽  
Q. Zhu ◽  
K. Li

This paper introduces and compares two types of GML-based data standards for indoor location-based services, i.e., <i>IndoorGML</i> and <i>IndoorLocationGML</i>. By elaborating the advantages of the both standards and their data models, we conclude that the two data standards are complementary to each other. A jointed data model is presented to show the integration of the two standards. <i>IndoorGML</i> can supply subdivision of building for data of <i>IndoorLocationGML</i>, and the semantics of locations defined in <i>IndoorLocationGML</i> can be added to <i>IndoorGML</i>. By proposing two use cases, we take the initiative in attempting to combine the use of the two standards. The first case is to collect details from files of the two standards for an indoor path; the second one is to generate verbal directions for indoor guidance from files of the two standards. Some future work is given for further development, such as automatic integration of separate data from both standards.


Author(s):  
Qin Ding ◽  
Gnanasekaran Sundarraj

With the growing usage of XML in the World Wide Web and elsewhere as a standard for the exchange of data and to represent semi-structured data, there is an imminent need for tools and techniques to perform data mining on XML documents and XML repositories. In this chapter, we propose a framework for association rule mining on XML data. We present a Java-based implementation of the Apriori and the FP-Growth algorithms for this task and compare their performances. We also compare the performance of our implementation with an XQuery-based implementation.


Author(s):  
Orsolya Takács ◽  
◽  
Annamária R. Várkonyi-Kóczy

The model used to represent information during information processing could affect achievable accuracy and could determine the usability of different calculation methods. The data model must also be able to represent uncertainty and inaccuracy both of input data and results. The two most popular data models for representation of uncertain data is the "classical", probability based, and the recently introduced fuzzy data models. Both data models have their own calculation and data processing methods, but with the increasing complexity of calculation problems, a method for the mixed use of these data models is be needed. This paper deals with possible solutions for information processing based on mixed data models and examines the different conversion methods between fuzzy and probability theory based data models.


2021 ◽  
pp. 1-25
Author(s):  
Yu-Chin Hsu ◽  
Ji-Liang Shiu

Under a Mundlak-type correlated random effect (CRE) specification, we first show that the average likelihood of a parametric nonlinear panel data model is the convolution of the conditional distribution of the model and the distribution of the unobserved heterogeneity. Hence, the distribution of the unobserved heterogeneity can be recovered by means of a Fourier transformation without imposing a distributional assumption on the CRE specification. We subsequently construct a semiparametric family of average likelihood functions of observables by combining the conditional distribution of the model and the recovered distribution of the unobserved heterogeneity, and show that the parameters in the nonlinear panel data model and in the CRE specification are identifiable. Based on the identification result, we propose a sieve maximum likelihood estimator. Compared with the conventional parametric CRE approaches, the advantage of our method is that it is not subject to misspecification on the distribution of the CRE. Furthermore, we show that the average partial effects are identifiable and extend our results to dynamic nonlinear panel data models.


2021 ◽  
Author(s):  
Matthias Held ◽  
Grit Laudel ◽  
Jochen Gläser

AbstractIn this paper we utilize an opportunity to construct ground truths for topics in the field of atomic, molecular and optical physics. Our research questions in this paper focus on (i) how to construct a ground truth for topics and (ii) the suitability of common algorithms applied to bibliometric networks to reconstruct these topics. We use the ground truths to test two data models (direct citation and bibliographic coupling) with two algorithms (the Leiden algorithm and the Infomap algorithm). Our results are discomforting: none of the four combinations leads to a consistent reconstruction of the ground truths. No combination of data model and algorithm simultaneously reconstructs all micro-level topics at any resolution level. Meso-level topics are not reconstructed at all. This suggests (a) that we are currently unable to predict which combination of data model, algorithm and parameter setting will adequately reconstruct which (types of) topics, and (b) that a combination of several data models, algorithms and parameter settings appears to be necessary to reconstruct all or most topics in a set of papers.


Sign in / Sign up

Export Citation Format

Share Document