scalable clustering
Recently Published Documents


TOTAL DOCUMENTS

91
(FIVE YEARS 23)

H-INDEX

14
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Ruhollah Shemirani ◽  
Gillian M Belbin ◽  
Keith Burghardt ◽  
Kristina Lerman ◽  
Christy L Avery ◽  
...  

Background: Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results: We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with ~51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of ~39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions: Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.


2021 ◽  
Author(s):  
James Anibal ◽  
Alexandre Day ◽  
Erol Bahadiroglu ◽  
Liam O'Neill ◽  
Long Phan ◽  
...  

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/


Author(s):  
Renchi Yang ◽  
Jieming Shi ◽  
Yin Yang ◽  
Keke Huang ◽  
Shiqi Zhang ◽  
...  

Author(s):  
Zhemin Zhou ◽  
Jane Charlesworth ◽  
Mark Achtman

Abstract Motivation Routine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present pHierCC, a pipeline that defines a scalable clustering scheme, HierCC, based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >530,000 genomes from Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia. Availability Implementation: https://enterobase.warwick.ac.uk/ and Source codes and instructions: https://github.com/zheminzhou/pHierCC Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Priti Maratha ◽  
Kapil Gupta

Abstract In spite of the severe limitations on the resources of the sensor nodes such as memory, computational power, transmission range and battery, the application areas of Wireless Sensor Networks (WSNs) are increasing day by day. The main challenge in WSNs is energy consumption. It becomes significant when a large number of nodes are deployed. Although clustering is one of the solutions to cater to this problem, but it suffers from severe energy consumption due to the non-uniform selection of CHs and frequent re-clustering. In this paper, we propose a heuristic and fuzzy based load balanced, scalable clustering algorithm for WSNs called HFLBSC. In this algorithm, we have segregated the network into a layered structure using the area under intersection over union curve. We have selected the CHs by considering residual energy and distance threshold. We have stalled the frequent re-clustering by utilizing the decision made with the help of fuzzy logic. Our proposed scheme is capable enough to elongate the network lifetime. Statistical analysis and simulation results confirm the superiority of proposed work in comparison to its competitor protocol.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Mahmoud A. Mahdi ◽  
Khalid M. Hosny ◽  
Ibrahim Elhenawy

2020 ◽  
Author(s):  
Zhemin Zhou ◽  
Jane Charlesworth ◽  
Mark Achtman

AbstractMotivationRoutine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present HierCC, a scalable clustering scheme based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >400,000 genomes from Salmonella, Escherichia, Yersinia and Clostridioides.AvailabilityImplementation: http://enterobase.warwick.ac.uk/ and Source codes: https://github.com/zheminzhou/[email protected] informationSupplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document