Introduction to Clustering

Dynamic and Advanced Data Mining for Progressing Technological Development ◽

10.4018/978-1-60566-908-3.ch010 ◽

2010 ◽

pp. 224-254

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This chapter provides the reader with an introduction to clustering algorithms and applications. A number of important well-known clustering methods are surveyed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. Algorithms are described for top-down and bottom-up hierarchical clustering, as are algorithms for K-Means clustering and for K-Medians clustering. The technique of representative points is also presented. Given the large data sets involved with clustering, the need to apply parallel computing to clustering arises, so they discuss issues related to parallel clustering as well. Throughout the chapter references are provided to works that contain a large number of experimental results. A comparison of the various clustering methods is given in tabular format. They conclude the chapter with a summary and an extensive list of references.

Download Full-text

Survey of Clustering

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2013040101 ◽

2013 ◽

Vol 3 (2) ◽

pp. 1-29 ◽

Cited By ~ 3

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Parallel Computation ◽

Clustering Algorithms ◽

Data Sets ◽

Clustering Methods ◽

Top Down ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This article is a survey into clustering applications and algorithms. A number of important well-known clustering methods are discussed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. More specifically, top-down and bottom-up hierarchical clustering are described. Additionally, K-Means and K-Medians clustering algorithms are also shown. The concept of representative points is introduced and the technique of discovering them is presented. Immense data sets in clustering often necessitate parallel computation. The authors discuss issues involving parallel clustering as well. Clustering deals with a large number of experimental results. The authors provide references to these works throughout the article. A table for comparing various clustering methods is given in the end. The authors give a summary and an extensive list of references, including some of the latest works in the field, to conclude the article.

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Low-Rank Matrix Factorization and Co-clustering Algorithms for Analyzing Large Data Sets

Lecture Notes in Computer Science - Data Engineering and Management ◽

10.1007/978-3-642-27872-3_41 ◽

2012 ◽

pp. 272-279 ◽

Cited By ~ 2

Author(s):

Archana Donavalli ◽

Manjeet Rege ◽

Xumin Liu ◽

Kourosh Jafari-Khouzani

Keyword(s):

Matrix Factorization ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Low Rank ◽

Data Sets ◽

Rank Matrix ◽

Low Rank Matrix

Download Full-text

Introduction

10.1093/acprof:oso/9780190602321.003.0001 ◽

2018 ◽

Author(s):

Helen K. Black ◽

John T. Groce ◽

Charles E. Harmon

Keyword(s):

African American ◽

Physical Health ◽

Family Caregivers ◽

Large Data ◽

Large Data Sets ◽

Care Work ◽

Advanced Technology ◽

Data Sets ◽

History Of ◽

Male Caregivers

Chapter One offers a brief history of the rise in awareness of the vast numbers of informal, family caregivers caring for aged, demented, and impaired loved ones in the home. The importance of informal caregivers to the healthcare system, both financially and emotionally, emerged in studies exploring the numbers of home caregivers and the nature of their care work. Early studies also focused on the sense of burden caregivers experienced due to caregiving. Since the 1980s, caregiving studies have been a constant in research, and have become increasingly complex in the use of large data sets and advanced technology to study the number of caregivers, their characteristics and labors, and the outcomes of caregiving on their emotional and physical health. Few studies have focused solely on the experience of caregiving in African-American elder male caregivers, and in the way we accomplish here.

Download Full-text

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2007.70272 ◽

2009 ◽

Vol 6 (2) ◽

pp. 344-352 ◽

Cited By ~ 35

Author(s):

V. Olman ◽

Fenglou Mao ◽

Hongwei Wu ◽

Ying Xu

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parallel Clustering

Download Full-text

Empirical comparison of fast partitioning-based clustering algorithms for large data sets

Expert Systems with Applications ◽

10.1016/s0957-4174(02)00185-9 ◽

2003 ◽

Vol 24 (4) ◽

pp. 351-363 ◽

Cited By ~ 27

Author(s):

Chih-Ping Wei ◽

Yen-Hsien Lee ◽

Che-Ming Hsu

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Empirical Comparison

Download Full-text

P-autoclass: scalable parallel clustering for mining large data sets

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2003.1198395 ◽

2003 ◽

Vol 15 (3) ◽

pp. 629-641 ◽

Cited By ~ 29

Author(s):

C. Pizzuti ◽

D. Talia

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parallel Clustering

Download Full-text

ANALYSIS OF COMMUNICATION OVERHEAD IN PARALLEL CLUSTERING OF LARGE DATA SETS WITH P-AUTOCLASS

Parallel Computing ◽

10.1142/9781860949630_0033 ◽

2002 ◽

Author(s):

STEFANO BASTA ◽

DOMENICO TALIA

Keyword(s):

Large Data ◽

Large Data Sets ◽

Communication Overhead ◽

Data Sets ◽

Parallel Clustering

Download Full-text

A FAST IMPLEMENTATION OF THE ISODATA CLUSTERING ALGORITHM

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195907002252 ◽

2007 ◽

Vol 17 (01) ◽

pp. 71-103 ◽

Cited By ~ 93

Author(s):

NARGESS MEMARSADEGHI ◽

DAVID M. MOUNT ◽

NATHAN S. NETANYAHU ◽

JACQUELINE LE MOIGNE

Keyword(s):

Clustering Algorithm ◽

Empirical Studies ◽

Synthetic Data ◽

Large Data ◽

Large Data Sets ◽

Cluster Center ◽

Data Sets ◽

Clustering Methods ◽

Sensing Applications ◽

Remote Sensing Applications

Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.

Download Full-text