HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

Abstract Motivation Routine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present pHierCC, a pipeline that defines a scalable clustering scheme, HierCC, based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >530,000 genomes from Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia. Availability Implementation: https://enterobase.warwick.ac.uk/ and Source codes and instructions: https://github.com/zheminzhou/pHierCC Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

10.1101/2020.11.25.397539 ◽

2020 ◽

Author(s):

Zhemin Zhou ◽

Jane Charlesworth ◽

Mark Achtman

Keyword(s):

Disease Surveillance ◽

Large Scale ◽

Core Genome ◽

Supplementary Information ◽

Source Codes ◽

Link Type ◽

Scalable Clustering ◽

Population Structures ◽

Multi Level ◽

Level Cluster

AbstractMotivationRoutine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present HierCC, a scalable clustering scheme based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >400,000 genomes from Salmonella, Escherichia, Yersinia and Clostridioides.AvailabilityImplementation: http://enterobase.warwick.ac.uk/ and Source codes: https://github.com/zheminzhou/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Multi-level clustering protocol for load-balanced and scalable clustering in large-scale wireless sensor networks

The Journal of Supercomputing ◽

10.1007/s11227-018-2727-5 ◽

2018 ◽

Vol 75 (7) ◽

pp. 3712-3739 ◽

Cited By ~ 2

Author(s):

Harmanpreet Singh ◽

Damanpreet Singh

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Large Scale ◽

Wireless Sensor ◽

Scalable Clustering ◽

Clustering Protocol ◽

Multi Level ◽

Load Balanced

Download Full-text

ADFinder: accurate detection of programmed DNA elimination using NGS high-throughput sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa226 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3632-3636 ◽

Cited By ~ 2

Author(s):

Weibo Zheng ◽

Jing Chen ◽

Thomas G Doak ◽

Weibo Song ◽

Ying Yan

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Source Codes ◽

High Throughput Sequencing Data ◽

Dna Elimination ◽

Multiple Alternative ◽

Dna Splicing

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large-scale entity representation learning for biomedical relationship extraction

Bioinformatics ◽

10.1093/bioinformatics/btaa674 ◽

2020 ◽

Author(s):

Mario Sänger ◽

Ulf Leser

Keyword(s):

Large Scale ◽

Research Question ◽

Relation Extraction ◽

Representation Learning ◽

Supplementary Information ◽

Source Codes ◽

Prior Art ◽

Relationship Extraction ◽

Learning Techniques ◽

Recent Representation

Abstract Motivation The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. Results We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation–disease, drug–disease and drug–drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4–29% regarding F1 score. Availability and implementation Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The three paradoxes of the energy transition - Assessing sustainability of large-scale solar photovoltaic through multi-level and multi-scalar perspective in Rwanda

Journal of Cleaner Production ◽

10.1016/j.jclepro.2020.125519 ◽

2021 ◽

Vol 288 ◽

pp. 125519

Author(s):

Carole Brunet ◽

Oumarou Savadogo ◽

Pierre Baptiste ◽

Michel A. Bouchard ◽

Céline Cholez ◽

...

Keyword(s):

Large Scale ◽

Energy Transition ◽

Solar Photovoltaic ◽

Multi Level

Download Full-text

A new multi-level algorithm for balanced partition problem on large scale directed graphs

Advances in Aerodynamics ◽

10.1186/s42774-021-00074-x ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Xianyue Li ◽

Yufei Pang ◽

Chenxia Zhao ◽

Yang Liu ◽

Qingzhen Dong

Keyword(s):

Large Scale ◽

Directed Graphs ◽

Vlsi Design ◽

Graph Partition ◽

Partition Problem ◽

Multi Level ◽

The Stability ◽

Recursive Partition ◽

Balanced Partition ◽

Partition Method

AbstractGraph partition is a classical combinatorial optimization and graph theory problem, and it has a lot of applications, such as scientific computing, VLSI design and clustering etc. In this paper, we study the partition problem on large scale directed graphs under a new objective function, a new instance of graph partition problem. We firstly propose the modeling of this problem, then design an algorithm based on multi-level strategy and recursive partition method, and finally do a lot of simulation experiments. The experimental results verify the stability of our algorithm and show that our algorithm has the same good performance as METIS. In addition, our algorithm is better than METIS on unbalanced ratio.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large scale gene regulatory network inference with a multi-level strategy

Molecular BioSystems ◽

10.1039/c5mb00560d ◽

2016 ◽

Vol 12 (2) ◽

pp. 588-597 ◽

Cited By ~ 14

Author(s):

Jun Wu ◽

Xiaodong Zhao ◽

Zongli Lin ◽

Zhifeng Shao

Keyword(s):

Gene Regulatory Network ◽

Regulatory Network ◽

Large Scale ◽

Network Inference ◽

Biological Processes ◽

Molecular Processes ◽

Gene Regulatory Network Inference ◽

Cell Functions ◽

Multi Level ◽

Gene Regulatory

Transcriptional regulation is a basis of many crucial molecular processes and an accurate inference of the gene regulatory network is a helpful and essential task to understand cell functions and gain insights into biological processes of interest in systems biology.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structure-aware protein–protein interaction site prediction using deep graph convolutional network

Bioinformatics ◽

10.1093/bioinformatics/btab643 ◽

2021 ◽

Author(s):

Qianmu Yuan ◽

Jianwen Chen ◽

Huiying Zhao ◽

Yaoqi Zhou ◽

Yuedong Yang

Keyword(s):

Protein Interactions ◽

Spatial Information ◽

Screening Tools ◽

Supplementary Information ◽

Protein Protein Interactions ◽

Convolutional Network ◽

Source Codes ◽

Site Prediction ◽

Protein Protein Interaction ◽

Mapping Techniques

Abstract Motivation Protein–protein interactions (PPI) play crucial roles in many biological processes, and identifying PPI sites is an important step for mechanistic understanding of diseases and design of novel drugs. Since experimental approaches for PPI site identification are expensive and time-consuming, many computational methods have been developed as screening tools. However, these methods are mostly based on neighbored features in sequence, and thus limited to capture spatial information. Results We propose a deep graph-based framework deep Graph convolutional network for Protein–Protein-Interacting Site prediction (GraphPPIS) for PPI site prediction, where the PPI site prediction problem was converted into a graph node classification task and solved by deep learning using the initial residual and identity mapping techniques. We showed that a deeper architecture (up to eight layers) allows significant performance improvement over other sequence-based and structure-based methods by more than 12.5% and 10.5% on AUPRC and MCC, respectively. Further analyses indicated that the predicted interacting sites by GraphPPIS are more spatially clustered and closer to the native ones even when false-positive predictions are made. The results highlight the importance of capturing spatially neighboring residues for interacting site prediction. Availability and implementation The datasets, the pre-computed features, and the source codes along with the pre-trained models of GraphPPIS are available at https://github.com/biomed-AI/GraphPPIS. The GraphPPIS web server is freely available at https://biomed.nscc-gz.cn/apps/GraphPPIS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text