scholarly journals Bacmeta: simulation for genomic evolution in bacterial metapopulations

2017 ◽  
Author(s):  
Aleksi Sipola ◽  
Pekka Marttinen ◽  
Jukka Corander

AbstractThe advent of genomic data from densely sampled bacterial populations has created a need for flexible simulators by which models and hypotheses can be efficiently investigated in the light of empirical observations. Bacmeta provides fast stochastic simulation of neutral evolution within a large collection of interconnected bacterial populations with completely adjustable connectivity network. Stochastic events of mutations, recombinations, insertions/deletions, migrations and microepidemics can be simulated in discrete non-overlapping generations with a Wright-Fisher model that operates on explicit sequence data of any desired genome length. Each model component, including locus, bacterial strain, population, and ultimately the whole metapopulation, is efficiently simulated using C++ objects, and detailed metadata from each level of the simulation can be acquired. The software can be executed in a cluster environment using simple textual input files, enabling, e.g., large-scale simulations and likelihood-free inference. Bacmeta is implemented with C++ for Linux, Mac and Windows. It is available at https://bitbucket.org/aleksisipola/bacmeta under the BSD 3-clause [email protected],[email protected] informationSupplementary data are available online at bioRxiv.


2018 ◽  
Author(s):  
Lucas Czech ◽  
Alexandros Stamatakis

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.



2017 ◽  
Author(s):  
Louis Gauthier ◽  
Rémicia Di Franco ◽  
Adrian W.R. Serohijos

AbstractMotivationSimulating protein evolution with realistic constraints from population genetics is essential in addressing problems in molecular evolution, from understanding the forces shaping the evolutionary landscape to the clinical challenges of antibiotic resistance, viral evolution and cancer.ResultsTo address this need, we present SodaPop, a new forward-time simulator of large asexual populations aimed at studying their structure, dynamics and the distribution of fitness effects with flexible assumptions on the fitness landscape. SodaPop integrates biochemical and biophysical properties in a cell-based, object-oriented framework and provides an efficient, open-source toolkit for performing large-scale simulations of protein evolution.Availability and implementationSource code and binaries are freely available at https://github.com/louisgt/SodaPop under the GNU GPLv3 license. The software is implemented in C++ and supported on Linux, Mac OS/X and [email protected] informationSupplementary information is available on the Github project page.



2020 ◽  
Vol 36 (12) ◽  
pp. 3874-3876 ◽  
Author(s):  
Sergio Arredondo-Alonso ◽  
Martin Bootsma ◽  
Yaïr Hein ◽  
Malbert R C Rogers ◽  
Jukka Corander ◽  
...  

Abstract Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data are often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and network partitioning based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short-read sequence data. Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git. Supplementary information Supplementary data are available at Bioinformatics online.



2017 ◽  
Author(s):  
Mirco Michel ◽  
David Menéndez Hurtado ◽  
Karolis Uziela ◽  
Arne Elofsson

AbstractMotivationAccurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known.ResultsWe present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these 415 have not been reported before.AvailabilityDatasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net/. All programs used here are freely [email protected] informationNo supplementary data



2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.



2017 ◽  
Author(s):  
Caroline Ross ◽  
Bilal Nizami ◽  
Michael Glenister ◽  
Olivier Sheik Amamuddy ◽  
Ali Rana Atilgan ◽  
...  

AbstractSummaryMODE-TASK, a novel software suite, comprises Principle Component Analysis, Multidimensional Scaling, and t-Distributed Stochastic Neighbor Embedding techniques using molecular dynamics trajectories. MODE-TASK also includes a Normal Mode Analysis tool based on Anisotropic Network Model so as to provide a variety of ways to analyse and compare large-scale motions of protein complexes for which long MD simulations are prohibitive.Availability and ImplementationMODE-TASK has been open-sourced, and is available for download from https://github.com/RUBi-ZA/MODE-TASK, implemented in Python and C++.Supplementary informationDocumentation available at http://mode-task.readthedocs.io.



2021 ◽  
Author(s):  
Yang Young Lu ◽  
Yiwen Wang ◽  
Fang Zhang ◽  
Jiaxing Bai ◽  
Ying Wang

AbstractMotivationUnderstanding the phylogenetic relationship among organisms is the key in contemporary evolutionary study and sequence analysis is the workhorse towards this goal. Conventional approaches to sequence analysis are based on sequence alignment, which is neither scalable to large-scale datasets due to computational inefficiency nor adaptive to next-generation sequencing (NGS) data. Alignment-free approaches are typically used as computationally effective alternatives yet still suffering the high demand of memory consumption. One desirable sequence comparison method at large-scale requires succinctly-organized sequence data management, as well as prompt sequence retrieval given a never-before-seen sequence as query.ResultsIn this paper, we proposed a novel approach, referred to as SAINT, for efficient and accurate alignment-free sequence comparison. Compared to existing alignment-free sequence comparison methods, SAINT offers advantages in two aspects: (1) SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; (2) SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences. We have applied SAINT to real-world datasets to demonstrate its empirical utility, both qualitatively and quantitatively. Considering the extensive applicability of alignment-free sequence comparison methods, we expect SAINT to motivate a more extensive set of applications in sequence comparison at large scale.AvailabilityThe open source, Apache licensed, python-implemented code will be available upon acceptance.Supplementary informationSupplementary data are available at Bioinformatics online.



2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.



2017 ◽  
Author(s):  
Halil Kilicoglu

AbstractAn estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted, due to problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the end result of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part towards enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload, and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can add checks and balances that promote responsible research practices and can provide significant benefits for the biomedical research enterprise.Supplementary informationSupplementary material is available at BioRxiv.



2020 ◽  
Vol 36 (12) ◽  
pp. 3841-3848
Author(s):  
Michael Gruenstaeudl

Abstract Motivation The submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with the concurrent development of tools to automate the preparatory work preceding such submissions. Results The author introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1500 fungal DNA sequences for database submission. Availability and implementation annonex2embl is freely available via the Python package index at http://pypi.python.org/pypi/annonex2embl. Supplementary information Supplementary data are available at Bioinformatics online.



Sign in / Sign up

Export Citation Format

Share Document