scholarly journals RaGOO: fast and accurate reference-guided scaffolding of draft genomes

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Michael Alonge ◽  
Sebastian Soyk ◽  
Srividya Ramakrishnan ◽  
Xingang Wang ◽  
Sara Goodwin ◽  
...  

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.

2019 ◽  
Author(s):  
Michael Alonge ◽  
Sebastian Soyk ◽  
Srividya Ramakrishnan ◽  
Xingang Wang ◽  
Sara Goodwin ◽  
...  

AbstractBackgroundAs the number of new genome assemblies continues to grow, there is increasing demand for methods to coalesce contigs from draft assemblies into pseudomolecules. Most current methods use genetic maps, optical maps, chromatin conformation (Hi-C), or other long-range linking data, however these data are expensive and analysis methods often fail to accurately order and orient a high percentage of assembly contigs. Other approaches utilize alignments to a reference genome for ordering and orienting, however these tools rely on slow aligners and are not robust to repetitive contigs.ResultsWe present RaGOO, an open-source reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in just minutes. With the pseudomolecules constructed, RaGOO identifies structural variants, including those spanning sequencing gaps that are not reported by alternative methods. We show that RaGOO accurately orders and orients contigs into nearly complete chromosomes based on de novo assemblies of Oxford Nanopore long-read sequencing from three wild and domesticated tomato genotypes, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open-source with an MIT license at https://github.com/malonge/RaGOO.ConclusionsWe demonstrate that with a highly contiguous assembly and a structurally accurate reference genome, reference-guided scaffolding with RaGOO outperforms error-prone reference-free methods and enable rapid pan-genome analysis.


2016 ◽  
Author(s):  
Andrew J. Page ◽  
Nishadi De Silva ◽  
Martin Hunt ◽  
Michael A. Quail ◽  
Julian Parkhill ◽  
...  

ABSTRACTThe rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. De novo assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20,000 draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9,404 genomes. We find all the genes used in MLST schema present in 99.6% of assembled genomes. When tested on low, neutral and high GC organisms, more than 94% of genes were present and completely intact. The pipeline has proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.DATA SUMMARYThe assembly pipeline software is available from Github under the GNU GPL open source license; (url - https://github.com/sanger-pathogens/vr-codebase)The assembly improvement software is available from Github under the GNU GPL open source license; (url - https://github.com/sanger-pathogens/assembly_improvement)Accession numbers for 9,404 assemblies are provided in the supplementary material.The Bordetella pertussis sample has sample accession ERS1058649, sequencing reads accession ERR1274624 and assembly accessions FJMX01000001-FJMX01000249.The Salmonella enterica subsp. enterica serovar Pullorum sample has sample accession ERS1058652, sequencing reads accession ERR1274625 and assembly accession FJMV01000001-FJMV01000026.The Staphylococcus aureus sample has sample accession ERS1058648, sequencing reads accession ERR1274626 and assembly accessions FJMW01000001-FJMW01000040.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.☑IMPACT STATEMENTThe pipeline described in this paper has been used to assemble and annotate 30% of all bacterial genome assemblies in GenBank (18,080 out of 59,536, accessed 16/2/16). The automated generation of de novo assemblies is a critical step to explore bacterial genome diversity. MLST genes are found in 99.6% of cases, making it at least as good as existing typing methods. In the test genomes we present, more than 94% of genes are correctly assembled into intact reading frames.


2014 ◽  
Vol 32 (10) ◽  
pp. 1045-1052 ◽  
Author(s):  
Ying-hui Li ◽  
Guangyu Zhou ◽  
Jianxin Ma ◽  
Wenkai Jiang ◽  
Long-guo Jin ◽  
...  

Author(s):  
Arang Rhie ◽  
Brian P. Walenz ◽  
Sergey Koren ◽  
Adam M. Phillippy

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied


Author(s):  
R. Zhang ◽  
M. Mirdita ◽  
E. Levy Karin ◽  
C. Norroy ◽  
C. Galiez ◽  
...  

SummarySpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein-level, optimizing its scores for matching very short sequences, and combining evidences from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes.Availability and implementationSpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS at spacepharer.soedinglab.org.


2015 ◽  
Author(s):  
Alejandro Hernandez Wences ◽  
Michael Schatz

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.


2020 ◽  
Vol 6 (12) ◽  
Author(s):  
Lin Zhao ◽  
Hongyou Chen ◽  
Xavier Didelot ◽  
Zhenpeng Li ◽  
Yinghui Li ◽  
...  

Vibrio parahaemolyticus is an important cause of foodborne gastroenteritis globally. Thermostable direct haemolysin (TDH) and the TDH-related haemolysin are the two key virulence factors in V. parahaemolyticus. Vibrio pathogenicity islands harbour the genes encoding these two haemolysins. The serotyping of V. parahaemolyticus is based on the combination of O and K antigens. Frequent recombination has been observed in V. parahaemolyticus , including in the genomic regions encoding the O and K antigens. V. parahaemolyticus serotype O4:K12 has caused gastroenteritis outbreaks in the USA and Spain. Recently, outbreaks caused by this serotype of V. parahaemolyticus have been reported in China. However, the relationships among this serotype of V. parahaemolyticus strains isolated in different regions have not been addressed. Here, we investigated the genome variation of the V. parahaemolyticus serotype O4:K12 using the whole-genome sequences of 29 isolates. We determined five distinct lineages in this strain collection. We observed frequent recombination among different lineages. In contrast, little recombination was observed within each individual lineage. We showed that the lineage of this serotype of V. parahaemolyticus isolated in America was different from those isolated in Asia and identified genes that exclusively existed in the strains isolated in America. Pan-genome analysis showed that strain-specific and cluster-specific genes were mostly located in the genomic islands. Pan-genome analysis also showed that the vast majority of the accessory genes in the O4:K12 serotype of V. parahaemolyticus were acquired from within the genus Vibrio . Hence, we have shown that multiple distinct lineages exist in V. parahaemolyticus serotype O4:K12 and have provided more evidence about the gene segregation found in V. parahaemolyticus isolated in different continents.


2021 ◽  
Author(s):  
Bo Wang ◽  
Yinping Jiao ◽  
Kapeel Chougule ◽  
Andrew Olson ◽  
Jian Huang ◽  
...  

ABSTRACTSorghum bicolor, one of the most important grass crops around the world, harbors a high degree of genetic diversity. We constructed chromosome-level genome assemblies for two important sorghum inbred lines, Tx2783 and RTx436. The final high-quality reference assemblies consist of 19 and 18 scaffolds, respectively, with contig N50 values of 25.6 and 20.3 Mb. Genes were annotated using evidence-based and de novo gene predictors, and RAMPAGE data demonstrate that transcription start sites were effectively captured. Together with other public sorghum genomes, BTx623, RTx430, and Rio, extensive structural variations (SVs) of various sizes were characterized using Tx2783 as a reference. Genome-wide scanning for disease resistance (R) genes revealed high levels of diversity among these five sorghum accessions. To characterize sugarcane aphid (SCA) resistance in Tx2783, we mapped the resistance region on chromosome 6 using a recombinant inbred line (RIL) population and found a SV of 191 kb containing a cluster of R genes in Tx2783. Using Tx2783 as a backbone, along with the SVs, we constructed a pan-genome to support alignment of resequencing data from 62 sorghum accessions, and then identified core and dispensable genes using this population. This study provides the first overview of the extent of genomic structural variations and R genes in the sorghum population, and reveals potential targets for breeding of SCA resistance.


2017 ◽  
Author(s):  
Daniel Valenzuela ◽  
Veli Mäkinen

AbstractRecently the topic of computational pan-genomics has gained increasing attention, and particularly the problem of moving from a single-reference paradigm to a pan-genomic one. Perhaps the simplest way to represent a pan-genome is to represent it as a set of sequences. While indexing highly repetitive collections has been intensively studied in the computer science community, the research has focused on efficient indexing and exact pattern patching, making most solutions not yet suitable to be used in bioinformatic analysis pipelines.Results:We present CHIC, a short-read aligner that indexes very large and repetitive references using a hybrid technique that combines Lempel-Ziv compression with Burrows-Wheeler read aligners.Availability:Our tool is open source and available online at https://gitlab.com/dvalenzu/CHIC


2019 ◽  
Author(s):  
Shujun Ou ◽  
Weija Su ◽  
Yi Liao ◽  
Kapeel Chougule ◽  
Doreen Ware ◽  
...  

AbstractSequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for each class of elements with unknown relative performance metrics. We benchmarked existing programs based on a curated library of rice TEs. Using the most robust programs, we created a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a condensed TE library for annotations of structurally intact and fragmented elements. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.


Sign in / Sign up

Export Citation Format

Share Document