scalable clustering Latest Research Papers

Selecting Clustering Algorithms for IBD Mapping

10.1101/2021.08.11.456036 ◽

2021 ◽

Author(s):

Ruhollah Shemirani ◽

Gillian M Belbin ◽

Keith Burghardt ◽

Kristina Lerman ◽

Christy L Avery ◽

...

Keyword(s):

Statistical Power ◽

Large Scale ◽

Clustering Algorithms ◽

False Negative ◽

Chromosome 1 ◽

Detection Methods ◽

Scalable Clustering ◽

Markov Clustering ◽

Cluster Properties ◽

Greedy Methods

Background: Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results: We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with ~51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of ~39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions: Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.

Scalable Clustering with Supervised Linkage Methods

10.1101/2021.08.01.454697 ◽

2021 ◽

Author(s):

James Anibal ◽

Alexandre Day ◽

Erol Bahadiroglu ◽

Liam O'Neill ◽

Long Phan ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Biomedical Sciences ◽

New Approach ◽

Scalable Clustering ◽

Linkage Methods ◽

Density Clustering ◽

Cell Data ◽

Different Levels

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/

Scalable clustering of segmented trajectories within a continuous time framework: application to maritime traffic data

Machine Learning ◽

10.1007/s10994-021-06004-8 ◽

2021 ◽

Author(s):

Pierre Gloaguen ◽

Laetitia Chapel ◽

Chloé Friguet ◽

Romain Tavenard

Keyword(s):

Continuous Time ◽

Traffic Data ◽

Scalable Clustering ◽

Maritime Traffic

Effective and Scalable Clustering on Massive Attributed Graphs

Proceedings of the Web Conference 2021 ◽

10.1145/3442381.3449875 ◽

2021 ◽

Author(s):

Renchi Yang ◽

Jieming Shi ◽

Yin Yang ◽

Keke Huang ◽

Shiqi Zhang ◽

...

Keyword(s):

Scalable Clustering ◽

Attributed Graphs

EMR: Scalable Clustering of Big HR Data using Evolutionary MapReduce

Companion Proceedings of the Web Conference 2021 ◽

10.1145/3442442.3453543 ◽

2021 ◽

Author(s):

Mahdi Bohlouli ◽

Zhonghua He

Keyword(s):

Scalable Clustering

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

Bioinformatics ◽

10.1093/bioinformatics/btab234 ◽

2021 ◽

Author(s):

Zhemin Zhou ◽

Jane Charlesworth ◽

Mark Achtman

Keyword(s):

Disease Surveillance ◽

Large Scale ◽

Core Genome ◽

Supplementary Information ◽

Source Codes ◽

Scalable Clustering ◽

Optimal Thresholds ◽

Population Structures ◽

Multi Level ◽

Level Cluster

Abstract Motivation Routine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present pHierCC, a pipeline that defines a scalable clustering scheme, HierCC, based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >530,000 genomes from Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia. Availability Implementation: https://enterobase.warwick.ac.uk/ and Source codes and instructions: https://github.com/zheminzhou/pHierCC Supplementary information Supplementary data are available at Bioinformatics online.

HFLBSC: Heuristic and Fuzzy based Load Balanced, Scalable Clustering Algorithm for Wireless Sensor Network

10.21203/rs.3.rs-306786/v1 ◽

2021 ◽

Author(s):

Priti Maratha ◽

Kapil Gupta

Keyword(s):

Energy Consumption ◽

Clustering Algorithm ◽

Power Transmission ◽

Residual Energy ◽

Sensor Nodes ◽

Wireless Sensor ◽

Scalable Clustering ◽

Main Challenge ◽

Union Curve ◽

Load Balanced

Abstract In spite of the severe limitations on the resources of the sensor nodes such as memory, computational power, transmission range and battery, the application areas of Wireless Sensor Networks (WSNs) are increasing day by day. The main challenge in WSNs is energy consumption. It becomes significant when a large number of nodes are deployed. Although clustering is one of the solutions to cater to this problem, but it suffers from severe energy consumption due to the non-uniform selection of CHs and frequent re-clustering. In this paper, we propose a heuristic and fuzzy based load balanced, scalable clustering algorithm for WSNs called HFLBSC. In this algorithm, we have segregated the network into a layered structure using the area under intersection over union curve. We have selected the CHs by considering residual energy and distance threshold. We have stalled the frequent re-clustering by utilizing the decision made with the help of fuzzy logic. Our proposed scheme is capable enough to elongate the network lifetime. Statistical analysis and simulation results confirm the superiority of proposed work in comparison to its competitor protocol.

Scalable Clustering Algorithms for Big data: A Review

IEEE Access ◽

10.1109/access.2021.3084057 ◽

2021 ◽

pp. 1-1

Author(s):

Mahmoud A. Mahdi ◽

Khalid M. Hosny ◽

Ibrahim Elhenawy

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Scalable Clustering

Share-a-cab: Scalable Clustering Taxi Group Ride Stand from Huge Geolocation data

IEEE Access ◽

10.1109/access.2021.3050299 ◽

2021 ◽

pp. 1-1

Author(s):

Wenbo Zhang ◽

Satish V. Ukkusuri

Keyword(s):

Scalable Clustering

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

10.1101/2020.11.25.397539 ◽

2020 ◽

Author(s):

Zhemin Zhou ◽

Jane Charlesworth ◽

Mark Achtman

Keyword(s):

Disease Surveillance ◽

Large Scale ◽

Core Genome ◽

Supplementary Information ◽

Source Codes ◽

Link Type ◽

Scalable Clustering ◽

Population Structures ◽

Multi Level ◽

Level Cluster

AbstractMotivationRoutine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present HierCC, a scalable clustering scheme based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >400,000 genomes from Salmonella, Escherichia, Yersinia and Clostridioides.AvailabilityImplementation: http://enterobase.warwick.ac.uk/ and Source codes: https://github.com/zheminzhou/[email protected] informationSupplementary data are available at Bioinformatics online.

scalable clustering
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Selecting Clustering Algorithms for IBD Mapping

Scalable Clustering with Supervised Linkage Methods

Scalable clustering of segmented trajectories within a continuous time framework: application to maritime traffic data

Effective and Scalable Clustering on Massive Attributed Graphs

EMR: Scalable Clustering of Big HR Data using Evolutionary MapReduce

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

HFLBSC: Heuristic and Fuzzy based Load Balanced, Scalable Clustering Algorithm for Wireless Sensor Network

Scalable Clustering Algorithms for Big data: A Review

Share-a-cab: Scalable Clustering Taxi Group Ride Stand from Huge Geolocation data

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

Export Citation Format

scalable clusteringRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Selecting Clustering Algorithms for IBD Mapping

Scalable Clustering with Supervised Linkage Methods

Scalable clustering of segmented trajectories within a continuous time framework: application to maritime traffic data

Effective and Scalable Clustering on Massive Attributed Graphs

EMR: Scalable Clustering of Big HR Data using Evolutionary MapReduce

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

HFLBSC: Heuristic and Fuzzy based Load Balanced, Scalable Clustering Algorithm for Wireless Sensor Network

Scalable Clustering Algorithms for Big data: A Review

Share-a-cab: Scalable Clustering Taxi Group Ride Stand from Huge Geolocation data

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

scalable clustering
Recently Published Documents