scholarly journals Learning Manifolds from Non-stationary Streams

Author(s):  
Suchismit Mahapatra ◽  
Varun Chandola

Abstract Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms. The predictive variance obtained from the GPR prediction is then shown to be an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution.

2012 ◽  
pp. 163-186
Author(s):  
Jirí Krupka ◽  
Miloslava Kašparová ◽  
Pavel Jirava ◽  
Jan Mandys

The chapter presents the problem of quality of life modeling in the Czech Republic based on classification methods. It concerns a comparison of methodological approaches; in the first case the approach of the Institute of Sociology of the Academy of Sciences of the Czech Republic was used, the second case is concerning a project of the civic association Team Initiative for Local Sustainable Development. On the basis of real data sets from the institute and team initiative the authors synthesized and analyzed quality of life classification models. They used decision tree classification algorithms for generating transparent decision rules and compare the classification results of decision tree. The classifier models on the basis of C5.0, CHAID, C&RT and C5.0 boosting algorithms were proposed and analyzed. The designed classification model was created in Clementine.


2005 ◽  
Vol 15 (04) ◽  
pp. 379-401 ◽  
Author(s):  
STEFAN FUNKE ◽  
THEOCHARIS MALAMATOS ◽  
RAHUL RAY

We consider the problem of computing large connected regions in a triangulated terrain of size n for which the normals of the triangles deviate by at most some small fixed angle. In previous work an exact near-quadratic algorithm was presented, but only a heuristic implementation with no guarantee was practicable. We present a new approximation algorithm for the problem which runs in O(n/∊2) time and—apart from giving a guarantee on the quality of the produced solution—has been implemented and shows good performance on real data sets representing fracture surfaces consisting of around half a million triangles. Further we present a simple approximation algorithm for a related problem: given a set of n points in the plane, determine the placement of the unit disk which contains most points. This algorithm runs in linear time as well.


2018 ◽  
Vol 20 (6) ◽  
pp. 2055-2065 ◽  
Author(s):  
Johannes Brägelmann ◽  
Justo Lorenzo Bermejo

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.


Author(s):  
J. DIEBOLT ◽  
M.-A. EL-AROUI ◽  
V. DURBEC ◽  
B. VILLAIN

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.


Biostatistics ◽  
2018 ◽  
Vol 20 (4) ◽  
pp. 615-631 ◽  
Author(s):  
Ekaterina Smirnova ◽  
Snehalata Huzurbazar ◽  
Farhad Jafari

Summary The human microbiota composition is associated with a number of diseases including obesity, inflammatory bowel disease, and bacterial vaginosis. Thus, microbiome research has the potential to reshape clinical and therapeutic approaches. However, raw microbiome count data require careful pre-processing steps that take into account both the sparsity of counts and the large number of taxa that are being measured. Filtering is defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. Despite progress in the number and quality of filtering approaches, there is no consensus on filtering standards and quality assessment. This can adversely affect downstream analyses and reproducibility of results across platforms and software. We introduce PERFect, a novel permutation filtering approach designed to address two unsolved problems in microbiome data processing: (i) define and quantify loss due to filtering by implementing thresholds and (ii) introduce and evaluate a permutation test for filtering loss to provide a measure of excessive filtering. Methods are assessed on three “mock experiment” data sets, where the true taxa compositions are known, and are applied to two publicly available real microbiome data sets. The method correctly removes contaminant taxa in “mock” data sets, quantifies and visualizes the corresponding filtering loss, providing a uniform data-driven filtering criteria for real microbiome data sets. In real data analyses PERFect tends to remove more taxa than existing approaches; this likely happens because the method is based on an explicit loss function, uses statistically principled testing, and takes into account correlation between taxa. The PERFect software is freely available at https://github.com/katiasmirn/PERFect.


2018 ◽  
Vol 29 (1) ◽  
pp. 529-539
Author(s):  
Khalid Jebari ◽  
Abdelaziz Elmoujahid ◽  
Aziz Ettouhami

Abstract Fuzzy c-means is an efficient algorithm that is amply used for data clustering. Nonetheless, when using this algorithm, the designer faces two crucial choices: choosing the optimal number of clusters and initializing the cluster centers. The two choices have a direct impact on the clustering outcome. This paper presents an improved algorithm called automatic genetic fuzzy c-means that evolves the number of clusters and provides the initial centroids. The proposed algorithm uses a genetic algorithm with a new crossover operator, a new mutation operator, and modified tournament selection; further, it defines a new fitness function based on three cluster validity indices. Real data sets are used to demonstrate the effectiveness, in terms of quality, of the proposed algorithm.


Author(s):  
Anh Duy Tran ◽  
Somjit Arch-int ◽  
Ngamnij Arch-int

Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs.


2018 ◽  
Vol 10 (10) ◽  
pp. 1559
Author(s):  
Xin Tian ◽  
Mi Jiang ◽  
Ruya Xiao ◽  
Rakesh Malhotra

The adaptive Goldstein filter driven by InSAR coherence is one of the most famous frequency domain-based filters and has been widely used to improve the quality of InSAR measurement with different noise features. However, the filtering power is biased to varying degrees due to the biased coherence estimator and empirical modelling of the filtering power under a given coherence level. This leads to under- or over-estimation of phase noise over the entire dataset. Here, the authors present a method to correct filtering power on the basis of the second kind statistical coherence estimator. In contrast with regular statistics, the new estimator has smaller bias and variance values, and therefore provides more accurate coherence observations. In addition, a piece-wise function model determined from the Monte Carlo simulation is used to compensate for the nonlinear relationship between the filtering parameter and coherence. This method was tested on both synthetic and real data sets and the results were compared against those derived from other state-of-the-art filters. The better performance of the new filter for edge preservation and residue reduction demonstrates the value of this method.


2020 ◽  
Vol 21 (2) ◽  
Author(s):  
Bogumiła Hnatkowska ◽  
Zbigniew Huzar ◽  
Lech Tuzinkiewicz

A conceptual model is a high-level, graphical representation of a specic do-main, presenting its key concepts and relationships between them. In particular, these dependencies can be inferred from concepts' instances being a part of big raw data les. The paper aims to propose a method for constructing a conceptual model from data frames encompassed in data les. The result is presented in the form of a class diagram. The method is explained with several examples and veried by a case study in which the real data sets are processed. It can also be applied for checking the quality of the data set.


Author(s):  
CHUN-GUANG LI ◽  
JUN GUO ◽  
BO XIAO

In this paper, a novel method to estimate the intrinsic dimensionality of high-dimensional data set is proposed. Based on neighborhood information, our method calculates the non-negative locally linear reconstruction coefficients from its neighbors for each data point, and the numbers of those dominant positive reconstruction coefficients are regarded as a faithful guide to the intrinsic dimensionality of data set. The proposed method requires no parametric assumption on data distribution and is easy to implement in the general framework of manifold learning. Experimental results on several synthesized data sets and real data sets have shown the benefits of the proposed method.


Sign in / Sign up

Export Citation Format

Share Document