scholarly journals Proximity Graphs for Similarity Searches: Experimental Survey and the New Connected-Partition Approach HGraph

Author(s):  
Larissa C. Shimomura ◽  
Daniel S. Kaster

Similarity searching is a widely used approach to retrieve complex data (images, videos, time series, etc.). Similarity searches aim at retrieving similar data according to the intrinsic characteristics of the data. Recently, graph-based methods have emerged as a very efficient alternative for similarity retrieval, with reports indicating they have outperformed methods of other categories in several situations. This work presents two main contributions to graph-based methods for similarity searches. The first contribution is a survey on the main graph types currently employed for similarity searches and an experimental evaluation of the most representative graphs in a common platform for exact and approximate search algorithms. The second contribution is a new graph-based method called HGraph, which is a connected-partition approach to build a proximity graph and answer similarity searches. Both of our contributions and results were published and received awards in international conferences.

2018 ◽  
Vol 11 (6) ◽  
pp. 2033-2048 ◽  
Author(s):  
Richard Hyde ◽  
Ryan Hossaini ◽  
Amber A. Leeson

Abstract. Clustering – the automated grouping of similar data – can provide powerful and unique insight into large and complex data sets, in a fast and computationally efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) has yet to be fully realised. In this paper, we explore the specific application of clustering to a multi-model climate ensemble. We hypothesise that clustering techniques can provide (a) a flexible, data-driven method of testing model–observation agreement and (b) a mechanism with which to identify model development priorities. We focus our analysis on chemistry–climate model (CCM) output of tropospheric ozone – an important greenhouse gas – from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Tropospheric column ozone from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm. We find that a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location offers a reduction of up to ∼ 20 % in the global absolute mean bias between the MMM and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMM. On a spatial basis, the bias is reduced at ∼ 62 % of all locations, with the largest bias reductions occurring in the Northern Hemisphere – where ozone concentrations are relatively large. However, the bias is unchanged at 9 % of all locations and increases at 29 %, particularly in the Southern Hemisphere. The latter demonstrates that although cluster-based subsampling acts to remove outlier model data, such data may in fact be closer to observed values in some locations. We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement among models and a measure of diversity across an ensemble. Finally, we discuss caveats of the clustering techniques and note that while we have focused on tropospheric ozone, the principles underlying the cluster-based MMMs are applicable to other prognostic variables from climate models.


2018 ◽  
Vol 28 (03) ◽  
pp. 227-253
Author(s):  
Fabrizio d’Amore ◽  
Paolo G. Franciosa

In this paper, we study the problem of designing robust algorithms for computing the minimum spanning tree, the nearest neighbor graph, and the relative neighborhood graph of a set of points in the plane, under the Euclidean metric. We use the term “robust” to denote an algorithm that can properly handle degenerate configurations of the input (such as co-circularities and collinearities) and that is not affected by errors in the flow of control due to round-off approximations. Existing asymptotically optimal algorithms that compute such graphs are either suboptimal in terms of the arithmetic precision required for the implementation, or cannot handle degeneracies, or are based on complex data structures. We present a unified approach to the robust computation of the above graphs. The approach is a variant of the general region approach for the computation of proximity graphs based on Yao graphs, first introduced in Ref. 43 (A. C.-C. Yao, On constructing minimum spanning trees in [Formula: see text]-dimensional spaces and related problems, SIAM J. Comput. 11(4) (1982) 721–736). We show that a sparse supergraph of these geometric graphs can be computed in asymptotically optimal time and space, and requiring only double precision arithmetic, which is proved to be optimal. The arithmetic complexity of the approach is measured by using the notion of degree, introduced in Ref. 31 (G. Liotta, F. P. Preparata and R. Tamassia, Robust proximity queries: An illustration of degree-driven algorithm design, SIAM J. Comput. 28(3) (1998) 864–889) and Ref. 3 (J. D. Boissonnat and F. P. Preparata, Robust plane sweep for intersecting segments, SIAM J. Comput. 29(5) (2000) 1401–1421). As a side effect of our results, we solve a question left open by Katajainen27 (J. Katajainen, The region approach for computing relative neighborhood graphs in the [Formula: see text] metric, Computing 40 (1987) 147–161) about the existence of a subquadratic algorithm, based on the region approach, that computes the relative neighborhood graph of a set of points [Formula: see text] in the plane under the [Formula: see text] metric.


2013 ◽  
Vol 3 (4) ◽  
pp. 84-104 ◽  
Author(s):  
Michail Kazimianec ◽  
Nikolaus Augsten

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.


Temporal data clustering examines the time series data to determine the basic structure and other characteristics of the data. Many methodologies simply process the temporal dimension of data but it still faces the many challenges for extracting useful patterns due to complex data types. In order to analyze the complex temporal data, Hybridized Gradient Descent Spectral Graph and Local-Global Louvain Clustering (HGDSG-LGLC) technique are designed. The number of temporal data is gathered from input dataset. Then the HGDSG-LGLC technique performs graph-based clustering to partitions the vertices i.e. data into different clusters depending on similarity matrix spectrum. The distance similarity is measured between the data and cluster mean. The Gradient Descent function find minimum distance between data and cluster mean. Followed by, the Local-Global Louvain method performs the merging and filtering of temporal data to connect the local and global edges of the graph with similar data. Then for each data, the change in modularity is calculated for filtering the unwanted data from its own cluster and merging it into the neighboring cluster. As a result, optimal ‘k’ numbers of clusters are obtained with higher accuracy with minimum error rate. Experimental analysis is performed with various parameters like clustering accuracy ( ), error rate ( ), computation time ( ) and space complexity ( ) with respect to number of temporal data. The proposed HGDSG-LGLC technique achieves higher and lesser , minimum as well as than conventional methods.


Author(s):  
Igor Akeksandrov ◽  
Vladimir Fomin

Introduction: The similarity search paradigm is used in various computational tasks, such as classification, data mining, pattern recognition, etc. Currently, the technology of tree-like metric access methods occupies a significant place among search algorithms. The classical problem of reducing the time of similarity search in metric space is relevant for modern systems when processing big complex data. Due to multidimensional nature of the search algorithm effectiveness problem, local research in this direction is in demand, constantly bringing useful results. Purpose: To reduce the computational complexity of tree search algorithms in problems involving metric proximity. Results: We developed a search algorithm for a multi-vantage-point tree, based on the priority node-processing queue. We mathematically formalized the problems of additional calculations and ways to solve them. To improve the performance of similarity search, we have proposed procedures for forming a priority queue of processing nodes and reducing the number of intersections of same level nodes. Structural changes in the multi-vantage-point tree and the use of minimum distances between vantage points and node subtrees provide better search efficiency. More accurate determination of the distance from the search object to the nodes and the fact that the search area intersects with a tree node allows you to reduce the amount of calculations. Practical relevance: The resulting search algorithms need less time to process information due to an insignificant increase in memory requirements. Reducing the information processing time expands the application boundaries of tree metric indexing methods in search problems involving large data sets.


Author(s):  
Imad Khoury ◽  
Godfried Toussaint ◽  
Antonio Ciampi ◽  
Isadora Antoniano

Clustering is considered the most important aspect of unsupervised learning in data mining. It deals with finding structure in a collection of unlabeled data. One simple way of defining clustering is as follows: the process of organizing data elements into groups, called clusters, whose members are similar to each other in some way. Several algorithms for clustering exist (Gan, Ma, & Wu, 2007); proximity-graph-based ones, which are untraditional from the point of view of statisticians, emanate from the field of computational geometry and are powerful and often elegant (Bhattacharya, Mukherjee, & Toussaint, 2005). A proximity graph is a graph formed from a collection of elements, or points, by connecting with an edge those pairs of points that satisfy a particular neighbor relationship with each other. One key aspect of proximity-graph-based clustering techniques is that they may allow for an easy and clear visualization of data clusters, given their geometric nature. Proximity graphs have been shown to improve typical instance-based learning algorithms such as the k-nearest neighbor classifiers in the typical nonparametric approach to classification (Bhattacharya, Mukherjee, & Toussaint, 2005). Furthermore, the most powerful and robust methods for clustering turn out to be those based on proximity graphs (Koren, North, & Volinsky, 2006). Many examples have been shown where proximity-graph-based methods perform very well when traditional methods fail miserably (Zahn, 1971; Choo, Jiamthapthaksin, Chen, Celepcikay, Giusti, & Eick, 2007).


2021 ◽  
Author(s):  
Shahriar Shirvani Moghaddam ◽  
Kiaksar Shirvani Moghaddam

Abstract Design an efficient data sorting algorithm that requires less time and space complexity is essential for large data sets in wireless networks, the Internet of things, data mining systems, computer science, and communications engineering. This paper proposes a low-complex data sorting algorithm that distinguishes the sorted/similar data, makes independent subarrays, and sorts the subarrays’ data using one of the popular sorting algorithms. It is proved that the mean-based pivot is as efficient as the median-based pivot for making equal-length subarrays. The numerical analyses indicate slight improvements in the elapsed time and the number of swaps of the proposed serial Merge-based and Quick-based algorithms compared to the conventional ones for low/high variance integer/non-integer uniform/Gaussian data, in different data lengths. However, using the gradual data extraction feature, the sorted parts can be extracted sequentially before ending the sorting process. Also, making independent subarrays proposes a general framework to parallel realization of sorting algorithms with separate parts. Simulation results indicate the effectiveness of the proposed parallel Merge-based and Quick-based algorithms to the conventional serial and multi-core parallel algorithms. Finally, the complexity of the proposed algorithm in both serial and parallel realizations is analyzed that shows an impressive improvement.


Author(s):  
Etienne de Harven ◽  
Nina Lampen

Samples of heparinized blood, or bone marrow aspirates, or cell suspensions prepared from biopsied tissues (nodes, spleen, etc. ) are routinely prepared, after Ficoll-Hypaque concentration of the mononuclear leucocytes, for scanning electron microscopy. One drop of the cell suspension is placed in a moist chamber on a poly-l-lysine pretreated plastic coverslip (Mazia et al., J. Cell Biol. 66:198-199, 1975) and fifteen minutes allowed for cell attachment. Fixation, started in 2. 5% glutaraldehyde in culture medium at room temperature for 30 minutes, is continued in the same fixative at 4°C overnight or longer. Ethanol dehydration is immediately followed by drying at the critical point of CO2 or of Freon 13. An efficient alternative method for ethanol dehydrated cells is to dry the cells at low temperature (-75°C) under vacuum (10-2 Torr) for 30 minutes in an Edwards-Pearse freeze-dryer (de Harven et al., SEM/IITRI/1977, 519-524). This is preceded by fast quenching in supercooled ethanol (between -90 and -100°C).


Author(s):  
R.L. Pinto ◽  
R.M. Woollacott

The basal body and its associated rootlet are the organelles responsible for anchoring the flagellum or cilium in the cytoplasm. Structurally, the common denominators of the basal apparatus are the basal body, a basal foot from which microtubules or microfilaments emanate, and a striated rootlet. A study of the basal apparatus from cells of the epidermis of a sponge larva was initiated to provide a comparison with similar data on adult sponges.Sexually mature colonies of Aplysillasp were collected from Keehi Lagoon Marina, Honolulu, Hawaii. Larvae were fixed in 2.5% glutaraldehyde and 0.14 M NaCl in 0.2 M Millonig’s phosphate buffer (pH 7.4). Specimens were postfixed in 1% OsO4 in 1.25% sodium bicarbonate (pH 7.2) and embedded in epoxy resin. The larva ofAplysilla sp was previously described (as Dendrilla cactus) based on live observations and SEM by Woollacott and Hadfield.


Sign in / Sign up

Export Citation Format

Share Document