scholarly journals ZGA: a flexible pipeline for read processing, de novo assembly and annotation of prokaryotic genomes

2021 ◽  
Author(s):  
A.A. Korzhenkov

AbstractWhole genome sequencing (WGS) became a routine method in modern days and may be applied to study a wide spectrum of scientific problems. Despite increasing availability of genome sequencing by itself, genome assembly and annotation could be a challenge for an inexperienced researcher. To solve this problem, a bioinformatic pipeline was developed to conduct a user from raw sequencing reads to annotated bacterial or archaeal genome ready for deposition to any INSDC database as NCBI, ENA or DDBJ. The pipeline is fully automated and doesn’t require internet connection after installation which prevents data leakage and premature publication of genome sequences. The source code of the pipeline is freely available at https://github.com/laxeye/zga/. The software may be installed from popular repositories: Anaconda Cloud (https://anaconda.org/bioconda/zga/) and PyPI (https://pypi.org/project/zga/).

2016 ◽  
Vol 4 (6) ◽  
Author(s):  
Elena Geiser ◽  
Florian Ludwig ◽  
Thiemo Zambanini ◽  
Nick Wierckx ◽  
Lars M. Blank

Some smut fungi of the family Ustilaginaceae produce itaconate from glucose. De novo genome sequencing of nine itaconate-producing Ustilaginaceae revealed genome sizes between 19 and 25 Mbp. Comparison to the itaconate cluster of U. maydis MB215 revealed all essential genes for itaconate production contributing to metabolic engineering for improving itaconate production.


2018 ◽  
Author(s):  
Roc Reguant ◽  
Yevgeniy Antipin ◽  
Rob Sheridan ◽  
Augustin Luna ◽  
Chris Sander

AbstractSummaryAlignmentViewer is multiple sequence alignment viewer for protein families with flexible visualization, analysis tools and links to protein family databases. It is directly accessible in web browsers without the need for software installation, as it is implemented in JavaScript, and does not require an internet connection to function. It can handle protein families with tens of thousands of sequences and is particularly suitable for evolutionary coupling analysis, facilitating the computation of protein 3D structures and the detection of functionally constrained interactions.Availability and ImplementationAlignmentViewer is open source software under the MIT license. The viewer is at http://alignmentviewer.org and the source code, documentation and issue tracking, for co-development, are at https://github.com/dfci/[email protected], reaches all authors


2018 ◽  
Author(s):  
Kim Lee Ng ◽  
Thor Bech Johannesen ◽  
Mark Østerlund ◽  
Kristoffer Kiil ◽  
Paal Skytt Andersen ◽  
...  

AbstractWhole-genome sequencing is becoming the method of choice but provides redundant data for many tasks. ReadFilter (https://github.com/ssi-dk/serum_readfilter) is offered as a way to improve run time of these tasks by rapidly filtering reads against user-specified sequences in order to work with a small fraction of original reads while maintaining accuracy. This can noticeably reduce mapping time and substantially reduce de novo assembly time.


2021 ◽  
Vol 7 (7) ◽  
Author(s):  
Casper Jamin ◽  
Sien De Koster ◽  
Stefanie van Koeveringe ◽  
Dieter De Coninck ◽  
Klaas Mensaert ◽  
...  

Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.


2021 ◽  
Author(s):  
Phuoc Truong Nguyen ◽  
Ilya Plyusnin ◽  
Tarja Sironen ◽  
Olli Vapalahti ◽  
Ravi Kant ◽  
...  

AbstractBackgroundSARS-CoV-2 related research has increased in importance worldwide since December 2019. Several new variants of SARS-CoV-2 have emerged globally, of which the most notable and concerning currently are the UK variant B.1.1.7, the South African variant B1.351 and the Brazilian variant P.1. Detecting and monitoring novel variants is essential in SARS-CoV-2 surveillance. While there are several tools for assembling virus genomes and performing lineage analyses to investigate SARS-CoV-2, each is limited to performing singular or a few functions separately.ResultsDue to the lack of publicly available pipelines, which could perform fast reference-based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect variants of concern, we have developed an open source bioinformatic pipeline called HaVoC (Helsinki university Analyzer for Variants Of Concern). HaVoC can reference assemble raw sequence reads and assign the corresponding lineages to SARS-CoV-2 sequences.ConclusionsHaVoC is a pipeline utilizing several bioinformatic tools to perform multiple necessary analyses for investigating genetic variance among SARS-CoV-2 samples. The pipeline is particularly useful for those who need a more accessible and fast tool to detect and monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. HaVoC is currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. HaVoC user manual and source code are available at https://www.helsinki.fi/en/projects/havoc and https://bitbucket.org/auto_cov_pipeline/havoc, respectively.


2019 ◽  
Author(s):  
Jullien M. Flynn ◽  
Robert Hubley ◽  
Clément Goubert ◽  
Jeb Rosen ◽  
Andrew G. Clark ◽  
...  

AbstractThe accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).SignificanceGenome sequences are being produced for more and more eukaryotic species. The bulk of these genomes is composed of parasitic, self-mobilizing transposable elements (TEs) that play important roles in organismal evolution. Thus there is a pressing need for developing software that can accurately identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package for the curation of reference TE libraries which can be applied to any eukaryotic species. Through several major improvements over the previous version, RepeatModeler2 is able to produce libraries that recapitulate the known composition of three model species with some of the most complex TE landscapes. Thus RepeatModeler2 will greatly enhance the discovery and annotation of TEs in genome sequences.


2020 ◽  
Vol 70 (11) ◽  
pp. 5958-5963
Author(s):  
Yuh Morimoto ◽  
Mari Tohya ◽  
Zulipiya Aibibula ◽  
Tadashi Baba ◽  
Hiroyuki Daida ◽  
...  

The taxonomic classification of Pseudomonas species has been revised and updated several times. This study utilized average nucleotide identity (ANI) and digital DNA–DNA hybridization (dDDH) cutoff values of 95 and 70 %, respectively, to re-identify the species of strains deposited in GenBank as P. aeruginosa , P. fluorescens and P. putida . Of the 264 deposited P. aeruginosa strains, 259 were correctly identified as P. aeruginosa , but the remaining five were not. All 28 deposited P. fluorescens strains had been incorrectly identified as P. fluorescens . Four of these strains were re-identified, including two as P. kilonensis and one each as P. aeruginosa and P. brassicacearum , but the remaining 24 could not be re-identified. Similarly, all 35 deposited P. putida strains had been incorrectly identified as P. putida . Nineteen of these strains were re-identified, including 12 as P. alloputida , four as P. asiatica and one each as P. juntendi , P. monteilii and P. mosselii . These results strongly suggest that Pseudomonas bacteria should be identified using ANI and dDDH analyses based on whole genome sequencing when Pseudomonas species are initially deposited in GenBank/DDBJ/EMBL databases.


2016 ◽  
Author(s):  
Julien Delafontaine ◽  
Alexandre Masselot ◽  
Robin Liechti ◽  
Dmitry Kuznetsov ◽  
Ioannis Xenarios ◽  
...  

AbstractSummary: Varapp is an open-source web application to filter variants from large sets of exome data stored in a relational database. Varapp offers a reactive graphical user interface, very fast data pro-cessing, security and facility to save, reproduce and shareresults. Typically, a few seconds suffice to apply non-trivial filters to a set of half a million variants and extract a handful of potential clinically relevant targets. Varapp implements different scenarios for Mendelian diseases (dominant, recessive, de novo, X-linked, andcompound heterozygous), and allows searching for variants in genes or chro-mosomal regions of interest.Availability: The application is made of a Javascript front-end and a Python back-end. Its source code is hosted at https://github.com/varapp. A demo version isavailable at https://varapp-demo.vital-it.ch. The full documentation can be found at https://varapp-demo.vital-it.ch/docs.Contact:[email protected]


2021 ◽  
Vol 10 (16) ◽  
Author(s):  
O. Francino ◽  
D. Pérez ◽  
J. Viñes ◽  
R. Fonticoba ◽  
S. Madroñero ◽  
...  

ABSTRACT We have de novo assembled and polished 61 Staphylococcus pseudintermedius genome sequences with Nanopore-only long reads. Completeness was 99.25%. The average genome size was 2.70 Mbp, comprising 2,506 coding sequences, 19 complete rRNAs, 56 to 59 tRNAs, and 4 noncoding RNAs (ncRNAs), as well as CRISPR arrays.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4588 ◽  
Author(s):  
Märt Roosaare ◽  
Mikk Puustusmaa ◽  
Märt Möls ◽  
Mihkel Vaher ◽  
Maido Remm

BackgroundPlasmids play an important role in the dissemination of antibiotic resistance, making their detection an important task. Using whole genome sequencing (WGS), it is possible to capture both bacterial and plasmid sequence data, but short read lengths make plasmid detection a complex problem.ResultsWe developed a tool named PlasmidSeeker that enables the detection of plasmids from bacterial WGS data without read assembly. The PlasmidSeeker algorithm is based onk-mers and usesk-mer abundance to distinguish between plasmid and bacterial sequences. We tested the performance of PlasmidSeeker on a set of simulated and real bacterial WGS samples, resulting in 100% sensitivity and 99.98% specificity.ConclusionPlasmidSeeker enables quick detection of known plasmids and complements existing tools that assemble plasmids de novo. The PlasmidSeeker source code is stored on GitHub:https://github.com/bioinfo-ut/PlasmidSeeker.


Sign in / Sign up

Export Citation Format

Share Document