Bayesian inversion and visualization of hierarchical geostatistical models

Author(s):  
Sebastian Reuschen ◽  
Teng Xu ◽  
Wolfgang Nowak

<p>Geostatistical inversion methods estimate the spatial distribution of heterogeneous soil properties (here: hydraulic conductivity) from indirect information (here: piezometric heads). Bayesian inversion is a specific approach, where prior assumptions (or prior models) are combined with indirect measurements to predict soil parameters and their uncertainty in form of a posterior parameter distribution. Posterior distributions depend heavily on prior models, as prior models describe the spatial structure of heterogeneity. The most common prior is the stationary multi-Gaussian model, which expresses that close-by points are more correlated than distant points. This is a good assumption for single-facies systems. For multi-facies systems, multiple-point geostatistical (MPS) methods are widely used. However, these typically only distinguish between several facies and do not represent the internal heterogeneity inside each facies.</p><p>We combine these two approaches to a joint hierarchical model, which results in a multi-facies system with internal heterogeneity in each facies. Using this model, we propose a tailored Gibbs sampler, a kind of Markov Chain Monte Carlo (MCMC) method, to perform Bayesian inversion and sample from the resulting posterior parameter distribution. We test our method on a synthetic channelized flow scenario for different levels of data available: A highly informative setting (with many measurements) where we recover the synthetic truth with relatively small uncertainty invervals, and a weakly informative setting (with only a few measurements) where the synthetic truth cannot be recovered that clearly. Instead, we obtain a multi-modal posterior. We investigate the multi-modal posterior using a clustering algorithm. Clustering algorithms are a common machine learning approach to find structures in large data sets. Using this approach, we can split the multi-modal posterior into its modes and can assign probabilities to each mode. A visualization of this clustering and the according probabilities enables researchers and engineers to intuitively understand complex parameter distributions and their uncertainties.</p>

2021 ◽  
Author(s):  
Sebastian Reuschen ◽  
Teng Xu ◽  
Fabian Jobst ◽  
Wolfgang Nowak

<p>Geostatistical inference (or inversion) methods are commonly used to estimate the spatial distribution of heterogeneous soil properties (e.g., hydraulic conductivity) from indirect measurements (e.g., piezometric heads). One approach is to use Bayesian inversion to combine prior assumptions (prior models) with indirect measurements to predict soil parameters and their uncertainty, which can be expressed in form of a posterior parameter distribution. This approach is mathematically rigorous and elegant, but has a disadvantage. In realistic settings, analytical solutions do not exist, and numerical evaluation via Markov chain Monte Carlo (MCMC) methods can become computationally prohibitive. Especially when treating spatially distributed parameters for heterogeneous materials, constructing efficient MCMC methods is a major challenge.</p><p>Here, we present two novel MCMC methods that extend and combine existing MCMC algorithms to speed up convergence for spatial parameter fields. First, we present the<em> sequential pCN-MCMC</em>, which is a combination of the <em>sequential Gibbs sampler</em>, and the <em>pCN-MCMC</em>. This <em>sequential pCN-MCMC</em> is more efficient (faster convergence) than existing methods. It can be used for Bayesian inversion of multi-Gaussian prior models, often used in single-facies systems. Second, we present the <em>parallel-tempering sequential Gibbs MCMC</em>. This MCMC variant enables realistic inversion of multi-facies systems. By this, we mean systems with several facies in which we model the spatial position of facies (via training images and multiple point geostatistics) and the internal heterogeneity per facies (via multi-Gaussian fields). The proposed MCMC version is the first efficient method to find the posterior parameter distribution for such multi-facies systems with internal heterogeneities.</p><p>We demonstrate the applicability and efficiency of the two proposed methods on hydro-geological synthetic test problems and show that they outperform existing state of the art MCMC methods. With the two proposed MCMCs, we enable modellers to perform (1) faster Bayesian inversion of multi-Gaussian random fields for single-facies systems and (2) Bayesian inversion of more realistic fields for multi-facies systems with internal heterogeneity at affordable computational effort.</p>


2017 ◽  
Vol 7 (1.3) ◽  
pp. 37
Author(s):  
Joy Christy A.

Data mining refers to the extraction of meaningful knowledge from large data sources as it may contain hidden potential facts. In general the analysis of data mining can either be predictive or descriptive. Predictive analysis of data mining interprets the inference of the existing results so as to identify the future outputs and the descriptive analysis of data mining interprets the intrinsic characteristics or nature of the data. Clustering is one of the descriptive analysis techniques of data mining which groups the objects of similar types in such a way that objects in a cluster are closer to each other than the objects of other clusters.  K-means is the most popular and widely used clustering algorithm that starts by selecting the k-random initial centroids as equal to number of clusters given by the user. It then computes the distance between initial centroids with the remaining data objects and groups the data objects into the cluster centroids with minimum distance. This process is repeated until there is no change in the cluster centroids or cluster members. But, still k-means has been suffered from several issues such as optimum number of k, random initial centroids, unknown number of iterations, global optimum solutions of clusters and more importantly the creation of meaningful clusters when dealing with the analysis of datasets from various domains. The accuracy involved with clustering should never be compromised. Thus, in this paper, a novel classification via clustering algorithm called Iterative Linear Regression Clustering with Percentage Split Distribution (ILRCPSD) is introduced as an alternate solution to the problems encountered in traditional clustering algorithms. The proposed algorithm is examined over an educational dataset to identify the hidden group of students having similar cognitive and competency skills.  The performance of the proposed algorithm is well-compared with the accuracy of the traditional k-means clustering in terms of building meaningful clusters and to prove its real time usefulness.


2020 ◽  
pp. 1-12
Author(s):  
Xiaoguang Gao

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.


2019 ◽  
Vol 22 (1) ◽  
pp. 55-58
Author(s):  
Nahla Ibraheem Jabbar

Our proposed method used to overcome the drawbacks of computing values parameters in the mountain algorithm to image clustering. All existing clustering algorithms are required values of parameters to starting the clustering process such as these algorithms have a big problem in computing parameters. One of the famous clustering is a mountain algorithm that gives expected number of clusters, we presented in this paper a new modification of mountain clustering called Spatial Modification in the Parameters of Mountain Image Clustering Algorithm. This modification in the spatial information of image by taking a window mask for each center pixel value to compute distance between pixel and neighborhood for estimation the values of parameters σ, β that gives a potential optimum number of clusters requiring in image segmentation process. Our experiments show ability the proposed algorithm in image brain segmentation with a quality in the large data sets


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Dan Zhang ◽  
Yingcang Ma ◽  
Hu Zhao ◽  
Xiaofei Yang

Clustering algorithm is one of the important research topics in the field of machine learning. Neutrosophic clustering is the generalization of fuzzy clustering and has been applied to many fields. This paper presents a new neutrosophic clustering algorithm with the help of regularization. Firstly, the regularization term is introduced into the FC-PFS algorithm to generate sparsity, which can reduce the complexity of the algorithm on large data sets. Secondly, we propose a method to simplify the process of determining regularization parameters. Finally, experiments show that the clustering results of this algorithm on artificial data sets and real data sets are mostly better than other clustering algorithms. Our clustering algorithm is effective in most cases.


2018 ◽  
Author(s):  
Xiaoxin Ye ◽  
Joshua W. K. Ho

AbstractFlow cytometry is a popular technology for quantitative single-cell profiling of cell surface markers. It enables expression measurement of tens of cell surface protein markers in millions of single cells. It is a powerful tool for discovering cell sub-populations and quantifying cell population heterogeneity. Traditionally, scientists use manual gating to identify cell types, but the process is subjective and is not effective for large multidimensional data. Many clustering algorithms have been developed to analyse these data but most of them are not scalable to very large data sets with more than ten million cells.Here, we present a new clustering algorithm that combines the advantages of density-based clustering algorithm DBSCAN with the scalability of grid-based clustering. This new clustering algorithm is implemented in python as an open source package, FlowGrid. FlowGrid is memory efficient and scales linearly with respect to the number of cells. We have evaluated the performance of FlowGrid against other state-of-the-art clustering programs and found that FlowGrid produces similar clustering results but with substantially less time. For example, FlowGrid is able to complete a clustering task on a data set of 23.6 million cells in less than 12 seconds, while other algorithms take more than 500 seconds or get into error.FlowGrid is an ultrafast clustering algorithm for large single-cell flow cy-tometry data. The source code is available at https://github.com/VCCRI/FlowGrid.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


Author(s):  
Yuancheng Li ◽  
Yaqi Cui ◽  
Xiaolong Zhang

Background: Advanced Metering Infrastructure (AMI) for the smart grid is growing rapidly which results in the exponential growth of data collected and transmitted in the device. By clustering this data, it can give the electricity company a better understanding of the personalized and differentiated needs of the user. Objective: The existing clustering algorithms for processing data generally have some problems, such as insufficient data utilization, high computational complexity and low accuracy of behavior recognition. Methods: In order to improve the clustering accuracy, this paper proposes a new clustering method based on the electrical behavior of the user. Starting with the analysis of user load characteristics, the user electricity data samples were constructed. The daily load characteristic curve was extracted through improved extreme learning machine clustering algorithm and effective index criteria. Moreover, clustering analysis was carried out for different users from industrial areas, commercial areas and residential areas. The improved extreme learning machine algorithm, also called Unsupervised Extreme Learning Machine (US-ELM), is an extension and improvement of the original Extreme Learning Machine (ELM), which realizes the unsupervised clustering task on the basis of the original ELM. Results: Four different data sets have been experimented and compared with other commonly used clustering algorithms by MATLAB programming. The experimental results show that the US-ELM algorithm has higher accuracy in processing power data. Conclusion: The unsupervised ELM algorithm can greatly reduce the time consumption and improve the effectiveness of clustering.


Author(s):  
M. Tanveer ◽  
Tarun Gupta ◽  
Miten Shah ◽  

Twin Support Vector Clustering (TWSVC) is a clustering algorithm inspired by the principles of Twin Support Vector Machine (TWSVM). TWSVC has already outperformed other traditional plane based clustering algorithms. However, TWSVC uses hinge loss, which maximizes shortest distance between clusters and hence suffers from noise-sensitivity and low re-sampling stability. In this article, we propose Pinball loss Twin Support Vector Clustering (pinTSVC) as a clustering algorithm. The proposed pinTSVC model incorporates the pinball loss function in the plane clustering formulation. Pinball loss function introduces favorable properties such as noise-insensitivity and re-sampling stability. The time complexity of the proposed pinTSVC remains equivalent to that of TWSVC. Extensive numerical experiments on noise-corrupted benchmark UCI and artificial datasets have been provided. Results of the proposed pinTSVC model are compared with TWSVC, Twin Bounded Support Vector Clustering (TBSVC) and Fuzzy c-means clustering (FCM). Detailed and exhaustive comparisons demonstrate the better performance and generalization of the proposed pinTSVC for noise-corrupted datasets. Further experiments and analysis on the performance of the above-mentioned clustering algorithms on structural MRI (sMRI) images taken from the ADNI database, face clustering, and facial expression clustering have been done to demonstrate the effectiveness and feasibility of the proposed pinTSVC model.


Sign in / Sign up

Export Citation Format

Share Document