scholarly journals HierCC: A multi-level clustering scheme for population assignments based on core genome MLST

Author(s):  
Zhemin Zhou ◽  
Jane Charlesworth ◽  
Mark Achtman

Abstract Motivation Routine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present pHierCC, a pipeline that defines a scalable clustering scheme, HierCC, based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >530,000 genomes from Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia. Availability Implementation: https://enterobase.warwick.ac.uk/ and Source codes and instructions: https://github.com/zheminzhou/pHierCC Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Zhemin Zhou ◽  
Jane Charlesworth ◽  
Mark Achtman

AbstractMotivationRoutine infectious disease surveillance is increasingly based on large-scale whole genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present HierCC, a scalable clustering scheme based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018, and has since genotyped >400,000 genomes from Salmonella, Escherichia, Yersinia and Clostridioides.AvailabilityImplementation: http://enterobase.warwick.ac.uk/ and Source codes: https://github.com/zheminzhou/[email protected] informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3632-3636 ◽  
Author(s):  
Weibo Zheng ◽  
Jing Chen ◽  
Thomas G Doak ◽  
Weibo Song ◽  
Ying Yan

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Mario Sänger ◽  
Ulf Leser

Abstract Motivation The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. Results We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation–disease, drug–disease and drug–drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4–29% regarding F1 score. Availability and implementation Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 288 ◽  
pp. 125519
Author(s):  
Carole Brunet ◽  
Oumarou Savadogo ◽  
Pierre Baptiste ◽  
Michel A. Bouchard ◽  
Céline Cholez ◽  
...  

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Xianyue Li ◽  
Yufei Pang ◽  
Chenxia Zhao ◽  
Yang Liu ◽  
Qingzhen Dong

AbstractGraph partition is a classical combinatorial optimization and graph theory problem, and it has a lot of applications, such as scientific computing, VLSI design and clustering etc. In this paper, we study the partition problem on large scale directed graphs under a new objective function, a new instance of graph partition problem. We firstly propose the modeling of this problem, then design an algorithm based on multi-level strategy and recursive partition method, and finally do a lot of simulation experiments. The experimental results verify the stability of our algorithm and show that our algorithm has the same good performance as METIS. In addition, our algorithm is better than METIS on unbalanced ratio.


2019 ◽  
Vol 35 (14) ◽  
pp. i417-i426 ◽  
Author(s):  
Erin K Molloy ◽  
Tandy Warnow

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 12 (2) ◽  
pp. 588-597 ◽  
Author(s):  
Jun Wu ◽  
Xiaodong Zhao ◽  
Zongli Lin ◽  
Zhifeng Shao

Transcriptional regulation is a basis of many crucial molecular processes and an accurate inference of the gene regulatory network is a helpful and essential task to understand cell functions and gain insights into biological processes of interest in systems biology.


Author(s):  
Ting-Hsuan Wang ◽  
Cheng-Ching Huang ◽  
Jui-Hung Hung

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Qianmu Yuan ◽  
Jianwen Chen ◽  
Huiying Zhao ◽  
Yaoqi Zhou ◽  
Yuedong Yang

Abstract Motivation Protein–protein interactions (PPI) play crucial roles in many biological processes, and identifying PPI sites is an important step for mechanistic understanding of diseases and design of novel drugs. Since experimental approaches for PPI site identification are expensive and time-consuming, many computational methods have been developed as screening tools. However, these methods are mostly based on neighbored features in sequence, and thus limited to capture spatial information. Results We propose a deep graph-based framework deep Graph convolutional network for Protein–Protein-Interacting Site prediction (GraphPPIS) for PPI site prediction, where the PPI site prediction problem was converted into a graph node classification task and solved by deep learning using the initial residual and identity mapping techniques. We showed that a deeper architecture (up to eight layers) allows significant performance improvement over other sequence-based and structure-based methods by more than 12.5% and 10.5% on AUPRC and MCC, respectively. Further analyses indicated that the predicted interacting sites by GraphPPIS are more spatially clustered and closer to the native ones even when false-positive predictions are made. The results highlight the importance of capturing spatially neighboring residues for interacting site prediction. Availability and implementation The datasets, the pre-computed features, and the source codes along with the pre-trained models of GraphPPIS are available at https://github.com/biomed-AI/GraphPPIS. The GraphPPIS web server is freely available at https://biomed.nscc-gz.cn/apps/GraphPPIS. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document