A technique for parallel query optimization using MapReduce framework and a semantic-based clustering method

Query optimization is the process of identifying the best Query Execution Plan (QEP). The query optimizer produces a close to optimal QEP for the given queries based on the minimum resource usage. The problem is that for a given query, there are plenty of different equivalent execution plans, each with a corresponding execution cost. To produce an effective query plan thus requires examining a large number of alternative plans. Access plan recommendation is an alternative technique to database query optimization, which reuses the previously-generated QEPs to execute new queries. In this technique, the query optimizer uses clustering methods to identify groups of similar queries. However, clustering such large datasets is challenging for traditional clustering algorithms due to huge processing time. Numerous cloud-based platforms have been introduced that offer low-cost solutions for the processing of distributed queries such as Hadoop, Hive, Pig, etc. This paper has applied and tested a model for clustering variant sizes of large query datasets parallelly using MapReduce. The results demonstrate the effectiveness of the parallel implementation of query workloads clustering to achieve good scalability.

Download Full-text

Learning with Adaptive Neighbors for Image Clustering

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/344 ◽

2018 ◽

Cited By ~ 2

Author(s):

Yang Liu ◽

Quanxue Gao ◽

Zhaohua Yang ◽

Shujian Wang

Keyword(s):

State Of The Art ◽

Clustering Algorithms ◽

Original Data ◽

Image Clustering ◽

Complex Structures ◽

Clustering Methods ◽

Proposed Model ◽

Data Graph ◽

The Given ◽

Optimal Graph

Due to the importance and efficiency of learning complex structures hidden in data, graph-based methods have been widely studied and get successful in unsupervised learning. Generally, most existing graph-based clustering methods require post-processing on the original data graph to extract the clustering indicators. However, there are two drawbacks with these methods: (1) the cluster structures are not explicit in the clustering results; (2) the final clustering performance is sensitive to the construction of the original data graph. To solve these problems, in this paper, a novel learning model is proposed to learn a graph based on the given data graph such that the new obtained optimal graph is more suitable for the clustering task. We also propose an efficient algorithm to solve the model. Extensive experimental results illustrate that the proposed model outperforms other state-of-the-art clustering algorithms.

Download Full-text

FUZZY CLUSTERING METHODS IN MULTISPECTRAL SATELLITE IMAGE SEGMENTATION

International Journal of Computing ◽

10.47839/ijc.8.1.660 ◽

2014 ◽

pp. 87-94

Author(s):

Rauf Kh. Sadykhov ◽

Valentin V. Ganchenko ◽

Leonid P. Podenok

Keyword(s):

Fuzzy Clustering ◽

Nonlinear Filtering ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Satellite Image ◽

Iteration Step ◽

Clustering Methods ◽

Image Size ◽

Landsat Images ◽

Multispectral Satellite Image

Segmentation method for subject processing the multi-spectral satellite images based on fuzzy clustering and preliminary non-linear filtering is represented. Three fuzzy clustering algorithms, namely Fuzzy C-means, Gustafson- Kessel, and Gath-Geva have been utilized. The experimental results obtained using these algorithms with and without preliminary nonlinear filtering to segment multi-spectral Landsat images have approved that segmentation based on fuzzy clustering provides good-looking discrimination of different land cover types. Implementations of Fuzzy Cmeans, Gustafson-Kessel, and Gath-Geva algorithms have got linear computational complexity depending on initial cluster amount and image size for single iteration step. They assume internal parallel implementation. The preliminary processing of source channels with nonlinear filter provides more clear cluster discrimination and has as a consequence more clear segment outlining…

Download Full-text

SIMULATION SCHEME MODELING OF THE SUPER-SPEED TIME BUFFER

Электросвязь ◽

10.34832/elsv.2020.9.8.011 ◽

2020 ◽

Author(s):

А.М. САЖНЕВ ◽

Л.Г. РОГУЛИНА

Keyword(s):

High Speed ◽

High Efficiency ◽

Low Cost ◽

Circuit Modeling ◽

Clock Signal ◽

Behavioral Models ◽

Software Environment ◽

Time Buffer ◽

The Given ◽

Simulation Scheme

Приводятся результаты моделирования сверхскоростного буфера тактовых сигналов, выполненного на базе арсенид-галлиевых n-канальных транзисторов в среде OrCAD и полностью отвечающего следующим требованиям: высокие технические характеристики, малые размеры, высокая частота и КПД, гибкость применения. Приведенные поведенческие модели допускают использование любой программной среды по схемотехническому моделированию. The results of simulation of an ultra-high-speed clock signal buffer based on gallium arsenide n-channel transistors in OrCAD are presented, which fully meets the following requirements: high technical characteristics, application flexibility, low cost, small size, high frequency, and high efficiency. The given behavioral models allow the use of any software environment for circuit modeling.

Download Full-text

Parallel Query Optimization

Encyclopedia of Database Systems ◽

10.1007/978-0-387-39940-9_1079 ◽

2009 ◽

pp. 2035-2038 ◽

Cited By ~ 1

Author(s):

Hans Zeller ◽

Goetz Graefe

Keyword(s):

Query Optimization ◽

Parallel Query

Download Full-text

A Partial Optimization Approach for Privacy Preserving Frequent Itemset Mining

International Journal of Computational Models and Algorithms in Medicine ◽

10.4018/jcmam.2010072002 ◽

2010 ◽

Vol 1 (1) ◽

pp. 19-33

Author(s):

Shibnath Mukherjee ◽

Aryya Gangopadhyay ◽

Zhiyuan Chen

Keyword(s):

Low Cost ◽

Synthetic Data ◽

Frequent Itemset ◽

Optimization Approach ◽

Data Generator ◽

Hidden Cost ◽

Potential Benefits ◽

The Difference ◽

The Given ◽

Optimal Set

While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible. Their methodology allows the user to adjust their preference on the weights assigned to benefits in terms of the number of restrictive patterns hidden, cost in terms of the number of legitimate patterns hidden, and damage to the database in terms of the difference between marginal frequencies of items for the original and sanitized databases. Most approaches in solving the given problem found in literature are all-heuristic based without formal treatment for optimality. While in a few work, ILP has been used previously as a formal optimization approach, the novelty of this method is the extremely low cost-complexity model in contrast to the others. They implement our methodology in C and C++ and ran several experiments with synthetic data generated with the IBM synthetic data generator. The experiments show excellent results when compared to those in the literature.

Download Full-text

RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest

Frontiers in Genetics ◽

10.3389/fgene.2021.665843 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yuan Zhao ◽

Zhao-Yu Fang ◽

Cui-Xiang Lin ◽

Chao Deng ◽

Yun-Pei Xu ◽

...

Keyword(s):

Random Forest ◽

Single Cell ◽

Gene Selection ◽

Clustering Algorithms ◽

Selection Methods ◽

Clustering Methods ◽

Cell Type ◽

Cell Type Specificity ◽

Random Forest Classification ◽

Forest Classification

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

PRIVACY PRESERVING CLUSTERING BASED ON LINEAR APPROXIMATION OF FUNCTION

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v12i5.2914 ◽

2013 ◽

Vol 12 (5) ◽

pp. 3443-3451

Author(s):

Rajesh Pasupuleti ◽

Narsimha Gugulothu

Keyword(s):

Linear Approximation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Privacy Preserving ◽

Distance Measures ◽

Clustering Methods ◽

Sensitive Data ◽

Processing Information ◽

Data Objects ◽

Approximation Of Function

Clustering analysis initiativesÂ a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of theÂ requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected byÂ user.Â In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields goodÂ results in practice with an example ofÂ business data are provided.Â It alsoÂ explains privacy preserving clusters of sensitive data objects.

Download Full-text

Short Text Clustering Algorithms for Weibo Topic Detection

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.971-973.1747 ◽

2014 ◽

Vol 971-973 ◽

pp. 1747-1751 ◽

Cited By ~ 1

Author(s):

Lei Zhang ◽

Hai Qiang Chen ◽

Wei Jie Li ◽

Yan Zhao Liu ◽

Run Pu Wu

Keyword(s):

Text Analysis ◽

Semantic Information ◽

Clustering Algorithms ◽

Text Clustering ◽

Massive Data ◽

Topic Detection ◽

Clustering Methods ◽

Short Text ◽

Short Text Clustering ◽

Application Requirements

Text clustering is a popular research topic in the field of text mining, and now there are a lot of text clustering methods catering to different application requirements. Currently, Weibo data acquisition is through the API provided by big microblogging platforms. In this essay, we will discuss the algorithm of extracting popular topics posted by Weibo users by text clustering after massive data collection. Due to the fact that traditional text analysis may not be applicable to short texts used in Weibo, text clustering shall be carried out through combining multiple posts into long texts, based on their features (forwards, comments and followers, etc.). Either frequency-based or density-based short text clustering can deliver in most cases. The former is applicable to find hot topics from large Weibo short texts, and the latter is applicable to find abnormal contents. Both the two methods use semantic information to improve the accuracy of clustering. Besides, they improve the performance of clustering through the parallelism.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text