Semi-Supervised Point Prototype Clustering

This paper describes a class of models we call semi-supervised clustering. Algorithms in this category are clustering methods that use information possessed by labeled training data Xd⊂ ℜp as well as structural information that resides in the unlabeled data Xu⊂ ℜp. The labels are used in conjunction with the unlabeled data to help clustering algorithms partition Xu ⊂ ℜp which then terminate without the capability to label other points in ℜp. This is very different from supervised learning, wherein the training data subsequently endow a classifier with the ability to label every point in ℜp. The methodology is applicable in domains such as image segmentation, where users may have a small set of labeled data, and can use it to semi-supervise classification of the remaining pixels in a single image. The model can be used with many different point prototype clustering algorithms. We illustrate how to attach it to a particular algorithm (fuzzy c-means). Then we give two numerical examples to show that it overcomes the failure of many point prototype clustering schemes when confronted with data that possess overlapping and/or non uniformly distributed clusters. Finally, the new method compares favorably to the fully supervised k nearest neighbor rule when applied to the Iris data.

Download Full-text

Editing training data for multi-label classification with the k-nearest neighbor rule

Pattern Analysis and Applications ◽

10.1007/s10044-015-0452-8 ◽

2015 ◽

Vol 19 (1) ◽

pp. 145-161 ◽

Cited By ~ 29

Author(s):

Sawsan Kanj ◽

Fahed Abdallah ◽

Thierry Denœux ◽

Kifah Tout

Keyword(s):

Nearest Neighbor ◽

Training Data ◽

K Nearest Neighbor ◽

Nearest Neighbor Rule

Download Full-text

Evaluating single-cell cluster stability using the Jaccard similarity index

Bioinformatics ◽

10.1093/bioinformatics/btaa956 ◽

2020 ◽

Author(s):

Ming Tang ◽

Yasin Kaymaz ◽

Brandon L Logeman ◽

Stephen Eichhorn ◽

Zhengzheng S Liang ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Similarity Index ◽

Cell Types ◽

R Package ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Jaccard Similarity ◽

Cluster Stability

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.

Download Full-text

Combined Clustering Methods for Microarray Data Analysis

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.8-9.508 ◽

2013 ◽

Vol 8-9 ◽

pp. 508-515

Author(s):

Raul Malutan ◽

Pedro Gómez Vilda ◽

Monica Borda

Keyword(s):

Supervised Classification ◽

Nearest Neighbor ◽

Training Data ◽

Microarray Data Analysis ◽

Support Vector ◽

Data Sets ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Gene Shaving

Data classification has an important role in analyzing high dimensional data. In this paper Gene Shaving algorithm was used for a previous supervised classification and once the cluster information was obtained, data was classified again with supervised algorithms like Support Vector Machine (SVM) and k-Nearest Neighbor (k-NN) for an optimal clustering. These algorithms have proven to be useful when the classes of the training data and the attributes of each class are well established. The algorithms were run on several data sets, observing that the quality of the obtained clusters is dependent on the number of clusters specified.

Download Full-text

A completely parameter-free method for graph-based single cell RNA-seq clustering

10.1101/2021.07.15.452521 ◽

2021 ◽

Author(s):

Maryam Zand ◽

Jianhua Ruan

Keyword(s):

Single Cell ◽

Cell Population ◽

Nearest Neighbor ◽

Expression Profiles ◽

Clustering Algorithms ◽

Cell Types ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Synthetic Datasets ◽

Almost All

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

k-Nearest Neighbor Learning with Graph Neural Networks

Mathematics ◽

10.3390/math9080830 ◽

2021 ◽

Vol 9 (8) ◽

pp. 830

Author(s):

Seokho Kang

Keyword(s):

Neural Network ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Weighting Function ◽

High Sensitivity ◽

Training Data ◽

K Nearest Neighbor ◽

Main Challenge ◽

Benchmark Datasets ◽

Graph Neural Networks

k-nearest neighbor (kNN) is a widely used learning algorithm for supervised learning tasks. In practice, the main challenge when using kNN is its high sensitivity to its hyperparameter setting, including the number of nearest neighbors k, the distance function, and the weighting function. To improve the robustness to hyperparameters, this study presents a novel kNN learning method based on a graph neural network, named kNNGNN. Given training data, the method learns a task-specific kNN rule in an end-to-end fashion by means of a graph neural network that takes the kNN graph of an instance to predict the label of the instance. The distance and weighting functions are implicitly embedded within the graph neural network. For a query instance, the prediction is obtained by performing a kNN search from the training data to create a kNN graph and passing it through the graph neural network. The effectiveness of the proposed method is demonstrated using various benchmark datasets for classification and regression tasks.

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

A novel content based image retrieval system using K-means/KNN with feature extraction

Computer Science and Information Systems ◽

10.2298/csis120122047c ◽

2012 ◽

Vol 9 (4) ◽

pp. 1645-1661 ◽

Cited By ~ 10

Author(s):

Ray-I Chang ◽

Shu-Yu Lin ◽

Jan-Ming Ho ◽

Chi-Wen Fann ◽

Yu-Chun Wang

Keyword(s):

Feature Extraction ◽

Image Retrieval ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Content Based Image Retrieval ◽

K Nearest Neighbor ◽

Color Analysis ◽

Image Retrieval System ◽

First Time ◽

System Designs

Image retrieval has been popular for several years. There are different system designs for content based image retrieval (CBIR) system. This paper propose a novel system architecture for CBIR system which combines techniques include content-based image and color analysis, as well as data mining techniques. To our best knowledge, this is the first time to propose segmentation and grid module, feature extraction module, K-means and k-nearest neighbor clustering algorithms and bring in the neighborhood module to build the CBIR system. Concept of neighborhood color analysis module which also recognizes the side of every grids of image is first contributed in this paper. The results show the CBIR systems performs well in the training and it also indicates there contains many interested issue to be optimized in the query stage of image retrieval.

Download Full-text

Android Malware Detection using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1011.0982s1219 ◽

2020 ◽

Vol 8 (2S12) ◽

pp. 65-70

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

User Interest ◽

Android Malware ◽

Android Malware Detection

Machine Learning is empowering many aspects of day-to-day lives from filtering the content on social networks to suggestions of products that we may be looking for. This technology focuses on taking objects as image input to find new observations or show items based on user interest. The major discussion here is the Machine Learning techniques where we use supervised learning where the computer learns by the input data/training data and predict result based on experience. We also discuss the machine learning algorithms: Naïve Bayes Classifier, K-Nearest Neighbor, Random Forest, Decision Tress, Boosted Trees, Support Vector Machine, and use these classifiers on a dataset Malgenome and Drebin which are the Android Malware Dataset. Android is an operating system that is gaining popularity these days and with a rise in demand of these devices the rise in Android Malware. The traditional techniques methods which were used to detect malware was unable to detect unknown applications. We have run this dataset on different machine learning classifiers and have recorded the results. The experiment result provides a comparative analysis that is based on performance, accuracy, and cost.

Download Full-text