A new affinity matrix weighted k-nearest neighbors graph to improve spectral clustering accuracy

Researchers have thought about clustering approaches that incorporate traditional clustering methods and deep learning techniques. These approaches normally boost the performance of clustering. Getting knowledge from large data-sets is quite an interesting task. In this case, we use some dimensionality reduction and clustering techniques. Spectral clustering is gaining popularity recently because of its performance. Lately, numerous techniques have been introduced to boost spectral clustering performance. One of the most significant part of these techniques is to construct a similarity graph. We introduced weighted k-nearest neighbors technique for the construction of similarity graph. Using this new metric for the construction of affinity matrix, we achieved good results as we tested it both on real and artificial data-sets.

Download Full-text

Machine learning in diachronic corpus phonology: mining verse data to infer trajectories in English phonotactics

Papers in Historical Phonology ◽

10.2218/pihph.3.2018.2878 ◽

2018 ◽

Vol 3 ◽

Author(s):

Andreas Baumann

Keyword(s):

Machine Learning ◽

Middle English ◽

Large Data ◽

Large Data Sets ◽

Machine Learning Techniques ◽

Data Sets ◽

Powerful Method ◽

K Nearest Neighbors ◽

Learning Techniques ◽

Standard Techniques

Machine learning is a powerful method when working with large data sets such as diachronic corpora. However, as opposed to standard techniques from inferential statistics like regression modeling, machine learning is less commonly used among phonological corpus linguists. This paper discusses three different machine learning techniques (K nearest neighbors classifiers; Naïve Bayes classifiers; artificial neural networks) and how they can be applied to diachronic corpus data to address specific phonological questions. To illustrate the methodology, I investigate Middle English schwa deletion and when and how it potentially triggered reduction of final /mb/ clusters in English.

Download Full-text

A Reformed K-Nearest Neighbors Algorithm for Big Data Sets

Journal of Computer Science ◽

10.3844/jcssp.2018.1213.1225 ◽

2018 ◽

Vol 14 (9) ◽

pp. 1213-1225 ◽

Cited By ~ 2

Author(s):

Vo Ngoc Phu ◽

Vo Thi Ngoc Tran

Keyword(s):

Big Data ◽

Nearest Neighbors ◽

Data Sets ◽

K Nearest Neighbors

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Deep Learning Approaches for Sentiment Analysis Challenges and Future Issues

10.4018/978-1-7998-8161-2.ch003 ◽

2022 ◽

pp. 27-50

Author(s):

Rajalaxmi Prabhu B. ◽

Seema S.

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Model Building ◽

Large Data ◽

Machine Learning Algorithms ◽

Large Data Sets ◽

Data Sets ◽

Learning Approaches ◽

Learning Techniques ◽

Important Challenge

A lot of user-generated data is available these days from huge platforms, blogs, websites, and other review sites. These data are usually unstructured. Analyzing sentiments from these data automatically is considered an important challenge. Several machine learning algorithms are implemented to check the opinions from large data sets. A lot of research has been undergone in understanding machine learning approaches to analyze sentiments. Machine learning mainly depends on the data required for model building, and hence, suitable feature exactions techniques also need to be carried. In this chapter, several deep learning approaches, its challenges, and future issues will be addressed. Deep learning techniques are considered important in predicting the sentiments of users. This chapter aims to analyze the deep-learning techniques for predicting sentiments and understanding the importance of several approaches for mining opinions and determining sentiment polarity.

Download Full-text

SPECTRAL CLUSTERING ON GENE EXPRESSION PROFILE TO IDENTIFY CANCER TYPES OR SUBTYPES

Jurnal Teknologi ◽

10.11113/jt.v76.4036 ◽

2015 ◽

Vol 76 (1) ◽

Author(s):

Ang Jun Chin ◽

Andri Mirzal ◽

Habibollah Haron

Keyword(s):

Gene Expression ◽

Gene Expression Profile ◽

Expression Profile ◽

Microarray Data ◽

Spectral Clustering ◽

Data Sets ◽

Clustering Methods ◽

Microarray Gene Expression ◽

Cancer Types ◽

Microarray Gene

Gene expression profile is eminent for its broad applications and achievements in disease discovery and analysis, especially in cancer research. Spectral clustering is robust to irrelevant features which are appropriated for gene expression analysis. However, previous works show that performance comparison with other clustering methods is limited and only a few microarray data sets were analyzed in each study. In this study, we demonstrate the use of spectral clustering in identifying cancer types or subtypes from microarray gene expression profiling. Spectral clustering was applied to eleven microarray data sets and its clustering performances were compared with the results in the literature. Based on the result, overall the spectral clustering slightly outperformed the corresponding results in the literature. The spectral clustering can also offer more stable clustering performances as it has smaller standard deviation value. Moreover, out of eleven data sets the spectral clustering outperformed the corresponding methods in the literature for six data sets. So, it can be stated that the spectral clustering is a promising method in identifying the cancer types or subtypes for microarray gene expression data sets.

Download Full-text

AN ASSESSMENT OF A METRIC SPACE DATABASE INDEX TO SUPPORT SEQUENCE HOMOLOGY

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213005002430 ◽

2005 ◽

Vol 14 (05) ◽

pp. 867-885 ◽

Cited By ~ 9

Author(s):

RUI MAO ◽

WEIJIA XU ◽

NEHA SINGH ◽

DANIEL P. MIRANKER

Keyword(s):

Metric Space ◽

Sequence Data ◽

Large Data ◽

Peptide Sequence ◽

Data Sets ◽

Clustering Methods ◽

Storage And Retrieval ◽

Database Index ◽

Bulk Load ◽

Scalable Database

Hierarchical metric-space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tree is one such data structure specialized for the management of large data sets on disk. We explore the application of M-trees to the storage and retrieval of peptide sequence data. Exploiting a technique first suggested by Myers, we organize the database as records of fixed length substrings. Empirical results are promising. However, metric-space indexes are subject to "the curse of dimensionality" and the ultimate performance of an index is sensitive to the quality of the initial construction of the index. We introduce new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index. Using the Yeast Proteomes, the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.

Download Full-text

Data Mining Techniques for Identification and Classification of Various Diseases in Plants

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1110.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 676-680

Keyword(s):

Neural Network ◽

Data Mining ◽

Nearest Neighbors ◽

Crop Productivity ◽

Vital Role ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbors ◽

Data Mining Techniques

Data mining is currently being used in various applications; In research community it plays a vital role. This paper specify about data mining techniques for the preprocessing and classification of various disease in plants. Since various plants has different diseases based on that each of them has different data sets and different objectives for knowledge discovery. Data Mining Techniques applied on plants that it helps in segmentation and classification of diseased plants, it avoids Oral Inspection and helps to increase in crop productivity. This paper provides various classification techniques Such as K-Nearest Neighbors, Support Vector Machine, Principle component Analysis, Neural Network. Thus among various techniques neural network is effective for disease detection in plants.

Download Full-text

The Effect of Weather in Soccer Results: An Approach Using Machine Learning Techniques

Applied Sciences ◽

10.3390/app10196750 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6750

Author(s):

Ditsuhi Iskandaryan ◽

Francisco Ramos ◽

Denny Asarias Palinggi ◽

Sergio Trilles

Keyword(s):

Support Vector Machine ◽

Nearest Neighbors ◽

Research Community ◽

Machine Learning Techniques ◽

Weather Data ◽

Support Vector ◽

K Nearest Neighbors ◽

Task Support ◽

Extremely Randomized Trees ◽

Learning Techniques

The growing popularity of soccer has led to the prediction of match results becoming of interest to the research community. The aim of this research is to detect the effects of weather on the result of matches by implementing Random Forest, Support Vector Machine, K-Nearest Neighbors Algorithm, and Extremely Randomized Trees Classifier. The analysis was executed using the Spanish La Liga and Segunda division from the seasons 2013–2014 to 2017–2018 in combination with weather data. Two tasks were proposed as part of this study: the first was to find out whether the game will end in a draw, a win by the hosts or a victory by the guests, and the second was to determine whether the match will end in a draw or if one of the teams will win. The results show that, for the first task, Extremely Randomized Trees Classifier is a better method, with an accuracy of 65.9%, and, for the second task, Support Vector Machine yielded better results with an accuracy of 79.3%. Moreover, it is possible to predict whether the game will end in a draw or not with 0.85 AUC-ROC. Additionally, for comparative purposes, the analysis was also performed without weather data.

Download Full-text

A FAST IMPLEMENTATION OF THE ISODATA CLUSTERING ALGORITHM

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195907002252 ◽

2007 ◽

Vol 17 (01) ◽

pp. 71-103 ◽

Cited By ~ 93

Author(s):

NARGESS MEMARSADEGHI ◽

DAVID M. MOUNT ◽

NATHAN S. NETANYAHU ◽

JACQUELINE LE MOIGNE

Keyword(s):

Clustering Algorithm ◽

Empirical Studies ◽

Synthetic Data ◽

Large Data ◽

Large Data Sets ◽

Cluster Center ◽

Data Sets ◽

Clustering Methods ◽

Sensing Applications ◽

Remote Sensing Applications

Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.

Download Full-text

Unsupervised Feature Selection Based on Spectral Clustering with Maximum Relevancy and Minimum Redundancy Approach

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421500312 ◽

2021 ◽

Vol 35 (11) ◽

pp. 2150031

Author(s):

Bahareh Khozaei ◽

Mahdi Eftekhari

Keyword(s):

Feature Selection ◽

Spectral Clustering ◽

Information Gain ◽

State Of The Art ◽

Nearest Neighbors ◽

Data Sets ◽

Unsupervised Feature Selection ◽

Significant Difference ◽

Cluster A ◽

Novel Approaches

In this paper, two novel approaches for unsupervised feature selection are proposed based on the spectral clustering. In the first proposed method, spectral clustering is employed over the features and the center of clusters is selected as well as their nearest-neighbors. These features have a minimum similarity (redundancy) between themselves since they belong to different clusters. Next, samples of data sets are clustered employing spectral clustering so that to the samples of each cluster a specific pseudo-label is assigned. After that according to the obtained pseudo-labels, the information gain of the features is computed that secures the maximum relevancy. Finally, the intersection of the selected features in the two previous steps is determined that simultaneously guarantees both the maximum relevancy and minimum redundancy. Our second proposed approach is very similar to the first one whose only but significant difference with the first method is that it selects one feature from each cluster and sorts all the features in terms of their relevancy. Then, by appending the selected features to a sorted list and ignoring them for the next step, the algorithm continues with the remaining features until all the features to be appended into the sorted list. Both of our proposed methods are compared with state-of-the-art methods and the obtained results confirm the performance of our proposed approaches especially the second one.

Download Full-text