Review and compare clustering algorithms for navigation data analysis tasks

Many data collected in sport science come from time dependent phenomenon. This article focuses on Functional Data Analysis (FDA), which study longitudinal data by modelling them as continuous functions. After a brief review of several FDA methods, some useful practical tools such as Functional Principal Component Analysis (FPCA) or functional clustering algorithms are presented and compared on simulated data. Finally, the problem of the detection of promising young swimmers is addressed through a curve clustering procedure on a real data set of performance progression curves. This study reveals that the fastest improvement of young swimmers generally appears before 16 years old. Moreover, several patterns of improvement are identified and the functional clustering procedure provides a useful detection tool.

Download Full-text

Clustering Algorithms in Gene Expression: Data Analysis

10.1109/icrito51393.2021.9596549 ◽

2021 ◽

Author(s):

Karuna Ghai ◽

Jaspreet Singh

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Gene Expression Data ◽

Clustering Algorithms ◽

Expression Data ◽

Gene Expression Data Analysis

Download Full-text

Application and visualization of typical clustering algorithms in seismic data analysis

Procedia Computer Science ◽

10.1016/j.procs.2019.04.026 ◽

2019 ◽

Vol 151 ◽

pp. 171-178

Author(s):

Z. Fan ◽

X. Xu

Keyword(s):

Data Analysis ◽

Seismic Data ◽

Clustering Algorithms

Download Full-text

Software implementation of the main cluster analysis tools

Revista Amazonia Investiga ◽

10.34069/ai/2021.47.11.9 ◽

2021 ◽

Vol 10 (47) ◽

pp. 81-92

Author(s):

Andrey V. Silin ◽

Olga N. Grinyuk ◽

Tatyana A. Lartseva ◽

Olga V. Aleksashina ◽

Tatiana S. Sukhova

Keyword(s):

Cluster Analysis ◽

Data Analysis ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Point Of View ◽

Software Implementation ◽

Practical Significance ◽

Data Set ◽

Main Cluster ◽

Analysis Tools

This article discusses an approach to creating a complex of programs for the implementation of cluster analysis methods. A number of cluster analysis tools for processing the initial data set and their software implementation are analyzed, as well as the complexity of the application of cluster data analysis. An approach to data is generalized from the point of view of factual material that supplies information for the problem under study and is the basis for discussion, analysis and decision-making. Cluster analysis is a procedure that combines objects or variables into groups based on a given rule. The work provides a grouping of multivariate data using proximity measures such as sample correlation coefficient and its module, cosine of the angle between vectors and Euclidean distance. The authors proposed a method for grouping by centers, by the nearest neighbor and by selected standards. The results can be used by analysts in the process of creating a data analysis structure and will improve the efficiency of clustering algorithms. The practical significance of the results of the application of the developed algorithms is expressed in the software package created by means of the C ++ language in the VS environment.

Download Full-text

A Comparison of K-Means and Mean Shift Algorithms

10.20944/preprints202108.0140.v1 ◽

2021 ◽

Author(s):

Mehak Nigar Shumaila

Keyword(s):

Cluster Analysis ◽

Data Analysis ◽

Time Complexity ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Mean Shift ◽

Prediction Performance ◽

Learning Problem ◽

Cluster A ◽

Formation Of Groups

Clustering, or otherwise known as cluster analysis, is a learning problem that takes place without any human supervision. This technique has often been utilized, much efficiently, in data analysis, and serves for observing and identifying interesting, useful, or desired patterns in the said data. The clustering technique functions by performing a structured division of the data involved, in similar objects based on the characteristics that it identifies. This process results in the formation of groups, and each group that is formed, is called a cluster. A single said cluster consists of objects from the data, that have similarities among other objects found in the same cluster, and resemble differences when compared to objects identified from the data that now exist in other clusters. The process of clustering is very significant in various aspects of data analysis, as it determines and presents the intrinsic grouping of objects present in the data, based on their attributes, in a batch of unlabeled raw data. A textbook or otherwise said, good criteria, does not exist in this method of cluster analysis. That is because this process is so different and so customizable for every user, that needs it in his/her various and different needs. There is no outright best clustering algorithm, as it massively depends on the user’s scenario and needs. This paper is intended to compare and study two different clustering algorithms. The algorithms under investigation are k-mean and mean shift. These algorithms are compared according to the following factors: time complexity, training, prediction performance and accuracy of the clustering algorithms.

Download Full-text

Subpopulation identification for single-cell RNA-sequencing data using functional data analysis

10.1101/760413 ◽

2019 ◽

Author(s):

Kyungmin Ahn ◽

Hironobu Fujiwara

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Single Cell ◽

Gene Expression Data ◽

Functional Data Analysis ◽

Functional Data ◽

Clustering Algorithms ◽

Expression Data ◽

Clustering Methods ◽

Single Cell Rna Sequencing

AbstractBackgroundIn single-cell RNA-sequencing (scRNA-seq) data analysis, a number of statistical tools in multivariate data analysis (MDA) have been developed to help analyze the gene expression data. This MDA approach is typically focused on examining discrete genomic units of genes that ignores the dependency between the data components. In this paper, we propose a functional data analysis (FDA) approach on scRNA-seq data whereby we consider each cell as a single function. To avoid a large number of dropouts (zero or zero-closed values) and reduce the high dimensionality of the data, we first perform a principal component analysis (PCA) and assign PCs to be the amplitude of the function. Then we use the index of PCs directly from PCA for the phase components. This approach allows us to apply FDA clustering methods to scRNA-seq data analysis.ResultsTo demonstrate the robustness of our method, we apply several existing FDA clustering algorithms to the gene expression data to improve the accuracy of the classification of the cell types against the conventional clustering methods in MDA. As a result, the FDA clustering algorithms achieve superior accuracy on simulated data as well as real data such as human and mouse scRNA-seq data.ConclusionsThis new statistical technique enhances the classification performance and ultimately improves the understanding of stochastic biological processes. This new framework provides an essentially different scRNA-seq data analytical approach, which can complement conventional MDA methods. It can be truly effective when current MDA methods cannot detect or uncover the hidden functional nature of the gene expression dynamics.

Download Full-text

Clustering with Scikit-Learn in Python

The Programming Historian ◽

10.46430/phen0094 ◽

2021 ◽

Author(s):

Thomas Jurczyk

Keyword(s):

Data Analysis ◽

Exploratory Data Analysis ◽

Clustering Algorithms ◽

Use Cases ◽

Use Case ◽

Greco Roman ◽

Textual Data ◽

Exploratory Data ◽

Second Use

This tutorial demonstrates how to apply clustering algorithms with Python to a dataset with two concrete use cases. The first example uses clustering to identify meaningful groups of Greco-Roman authors based on their publications and their reception. The second use case applies clustering algorithms to textual data in order to discover thematic groups. After finishing this tutorial, you will be able to use clustering in Python with Scikit-learn applied to your own data, adding an invaluable method to your toolbox for exploratory data analysis.

Download Full-text

LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis

10.1101/2021.12.24.474111 ◽

2021 ◽

Author(s):

Ezgi Ozkurt ◽

Joachim Fritscher ◽

Nicola Soranzo ◽

Duncan Y.K. Ng ◽

Robert P. Davey ◽

...

Keyword(s):

Data Analysis ◽

Clustering Algorithms ◽

Amplicon Sequencing ◽

Sequencing Analysis ◽

Alpha And Beta Diversity ◽

High Data ◽

Data Usage ◽

Long Read ◽

Cost Efficient ◽

User Friendly

Background: Amplicon sequencing is an established and cost-efficient method for profiling microbiomes. However, many available tools to process this data require both bioinformatics skills and high computational power to process big datasets. Furthermore, there are only few tools that allow for long read amplicon data analysis. To bridge this gap, we developed the LotuS2 (Less OTU Scripts 2) pipeline, enabling user-friendly, resource friendly, and versatile analysis of raw amplicon sequences. Results: In LotuS2, six different sequence clustering algorithms as well as extensive pre- and post-processing options allow for flexible data analysis by both experts, where parameters can be fully adjusted, and novices, where defaults are provided for different scenarios. We benchmarked three independent gut and soil datasets, where LotuS2 was on average 29 times faster compared to other pipelines - yet could better reproduce the alpha- and beta-diversity of technical replicate samples. Further benchmarking a mock community with known taxa composition showed that, compared to the other pipelines, LotuS2 recovered a higher fraction of correctly identified genera and species (98% and 57%, respectively). At ASV/OTU level, precision and F-score were highest for LotuS2, as was the fraction of correctly reconstructed 16S sequences. Conclusion: LotuS2 is a lightweight and user-friendly pipeline that is fast, precise and streamlined. High data usage rates and reliability enable high-throughput microbiome analysis in minutes. Availability: LotuS2 is available from GitHub, conda or via a Galaxy web interface, documented at http://lotus2.earlham.ac.uk/.

Download Full-text

Data Analysis Using Representation Theory and Clustering Algorithms

WSEAS TRANSACTIONS ON COMPUTERS ◽

10.37394/23205.2020.19.38 ◽

2021 ◽

Vol 19 ◽

pp. 310-320

Author(s):

Suboh Alkhushayni ◽

Taeyoung Choi ◽

Du’a Alzaleq

Keyword(s):

Data Analysis ◽

Random Forest ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Categorical Variables ◽

Common Disease ◽

Agglomerative Hierarchical Clustering ◽

Data Set

This work aims to expand the knowledge of the area of data analysis through both persistence homology, as well as representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical data set. By testing the data, we tried to find the agglomerative hierarchical clustering method that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we came with a conclusion that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with random forest are the most beneficial approaches when working with categorical variables. All these tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, for the business area, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators.

Download Full-text

Review and compare clustering algorithms for navigation data analysis tasks

A review on density-based clustering algorithms for big data analysis

Functional Data Analysis in Sport Science: Example of Swimmers’ Progression Curves Clustering

Clustering Algorithms in Gene Expression: Data Analysis

Application and visualization of typical clustering algorithms in seismic data analysis

Software implementation of the main cluster analysis tools

A Comparison of K-Means and Mean Shift Algorithms

Subpopulation identification for single-cell RNA-sequencing data using functional data analysis

Clustering with Scikit-Learn in Python

LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis

Data Analysis Using Representation Theory and Clustering Algorithms

Export Citation Format