Fast Fuzzy Search for Mixed Data Using Locality Sensitive Hashing

2013 ◽  
Vol 462-463 ◽  
pp. 321-325 ◽  
Author(s):  
Kyung Mi Lee ◽  
Keon Myung Lee

The drastic increase in data volume strongly demands efficient search techniques for similar data to queries. It is sometimes useful to specify data of interest with fuzzy constraints. When data objects contain both numerical and categorical attributes, it is usually not easy to define commonly-accepted distance measures between data objects. With no efficient indexing structure, it costs much to search for specific data objects because a linear search needs to be conducted over the whole data set. This paper proposes a method to use locality sensitive hashing technique and fuzzy constrained queries to search for interesting ones from big data. The method builds up a locality sensitive hashing-based indexing structure only with constituting continuous attributes, collects a small number of candidate data objects to which query is examined, and then evaluates their satisfaction degree to fuzzy constrained query so that data objects satisfying the query are determined.

Author(s):  
T. Warren Liao

In this chapter, we present genetic algorithm (GA) based methods developed for clustering univariate time series with equal or unequal length as an exploratory step of data mining. These methods basically implement the k-medoids algorithm. Each chromosome encodes in binary the data objects serving as the k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first employed the synthetic control chart data set to investigate the performance of three fitness functions, two distance measures, and other GA parameters such as population size, crossover rate, and mutation rate. Two more sets of time series with or without known number of clusters were also experimented: one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering results are presented and discussed.


Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.


2020 ◽  
Vol 10 (7) ◽  
pp. 2539 ◽  
Author(s):  
Toan Nguyen Mau ◽  
Yasushi Inoguchi

It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.


2008 ◽  
pp. 942-962
Author(s):  
T. Warren Liao

In this chapter, we present genetic algorithm (GA) based methods developed for clustering univariate time series with equal or unequal length as an exploratory step of data mining. These methods basically implement the k-medoids algorithm. Each chromosome encodes in binary the data objects serving as the k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first employed the synthetic control chart data set to investigate the performance of three fitness functions, two distance measures, and other GA parameters such as population size, crossover rate, and mutation rate. Two more sets of time series with or without known number of clusters were also experimented: one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering results are presented and discussed.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2021 ◽  
Vol 19 (1) ◽  
pp. 2-20
Author(s):  
Piyush Kant Rai ◽  
Alka Singh ◽  
Muhammad Qasim

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.


2013 ◽  
Vol 12 (5) ◽  
pp. 3443-3451
Author(s):  
Rajesh Pasupuleti ◽  
Narsimha Gugulothu

Clustering analysis initiatives  a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the  requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected by  user.  In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields good  results in practice with an example of  business data are provided.  It also  explains privacy preserving clusters of sensitive data objects.


SPE Journal ◽  
2021 ◽  
pp. 1-25
Author(s):  
Chang Gao ◽  
Juliana Y. Leung

Summary The steam-assisted gravity drainage (SAGD) recovery process is strongly impacted by the spatial distributions of heterogeneous shale barriers. Though detailed compositional flow simulators are available for SAGD recovery performance evaluation, the simulation process is usually quite computationally demanding, rendering their use over a large number of reservoir models for assessing the impacts of heterogeneity (uncertainties) to be impractical. In recent years, data-driven proxies have been widely proposed to reduce the computational effort; nevertheless, the proxy must be trained using a large data set consisting of many flow simulation cases that are ideally spanning the model parameter spaces. The question remains: is there a more efficient way to screen a large number of heterogeneous SAGD models? Such techniques could help to construct a training data set with less redundancy; they can also be used to quickly identify a subset of heterogeneous models for detailed flow simulation. In this work, we formulated two particular distance measures, flow-based and static-based, to quantify the similarity among a set of 3D heterogeneous SAGD models. First, to formulate the flow-based distance measure, a physics-basedparticle-tracking model is used: Darcy’s law and energy balance are integrated to mimic the steam chamber expansion process; steam particles that are located at the edge of the chamber would release their energy to the surrounding cold bitumen, while detailed fluid displacements are not explicitly simulated. The steam chamber evolution is modeled, and a flow-based distance between two given reservoir models is defined as the difference in their chamber sizes over time. Second, to formulate the static-based distance, the Hausdorff distance (Hausdorff 1914) is used: it is often used in image processing to compare two images according to their corresponding spatial arrangement and shapes of various objects. A suite of 3D models is constructed using representative petrophysical properties and operating constraints extracted from several pads in Suncor Energy’s Firebag project. The computed distance measures are used to partition the models into different groups. To establish a baseline for comparison, flow simulations are performed on these models to predict the actual chamber evolution and production profiles. The grouping results according to the proposed flow- and static-based distance measures match reasonably well to those obtained from detailed flow simulations. Significant improvement in computational efficiency is achieved with the proposed techniques. They can be used to efficiently screen a large number of reservoir models and facilitate the clustering of these models into groups with distinct shale heterogeneity characteristics. It presents a significant potential to be integrated with other data-driven approaches for reducing the computational load typically associated with detailed flow simulations involving multiple heterogeneous reservoir realizations.


Author(s):  
Avinash Navlani ◽  
V. B. Gupta

In the last couple of decades, clustering has become a very crucial research problem in the data mining research community. Clustering refers to the partitioning of data objects such as records and documents into groups or clusters of similar characteristics. Clustering is unsupervised learning, because of unsupervised nature there is no unique solution for all problems. Most of the time complex data sets require explanation in multiple clustering sets. All the Traditional clustering approaches generate single clustering. There is more than one pattern in a dataset; each of patterns can be interesting in from different perspectives. Alternative clustering intends to find all unlike groupings of the data set such that each grouping has high quality and distinct from each other. This chapter gives you an overall view of alternative clustering; it's various approaches, related work, comparing with various confusing related terms like subspace, multi-view, and ensemble clustering, applications, issues, and challenges.


Author(s):  
Abha Sharma ◽  
R. S. Thakur

Analyzing clustering of mixed data set is a complex problem. Very useful clustering algorithms like k-means, fuzzy c-means, hierarchical methods etc. developed to extract hidden groups from numeric data. In this paper, the mixed data is converted into pure numeric with a conversion method, the various algorithm of numeric data has been applied on various well known mixed datasets, to exploit the inherent structure of the mixed data. Experimental results shows how smoothly the mixed data is giving better results on universally applicable clustering algorithms for numeric data.


Sign in / Sign up

Export Citation Format

Share Document