scholarly journals MaGuS: a tool for map-guided scaffolding and quality assessment of genome assemblies

2015 ◽  
Author(s):  
Mohammed-Amin Madoui ◽  
Carole Dossat ◽  
Leo d'Agata ◽  
Edwin van der Vossen ◽  
Jan van Oeveren ◽  
...  

Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring. Results We present MaGuS (map-guided scaffolding), a modular tool that uses a draft genome assembly, a genome map, and high-throughput paired-end sequencing data to estimate the quality and to enhance the continuity of an assembly. We generated several assemblies of the Arabidopsis genome using different scaffolding programs and applied MaGuS to select the best assembly using quality metrics. Then, we used MaGuS to perform map-guided scaffolding to increase continuity by creating new scaffold links in low-covered and highly repetitive regions where other commonly used scaffolding methods lack consistency. Conclusions MaGuS is a powerful reference-free evaluator of assembly quality and a map-guided scaffolder that is freely available at https://github.com/institut-de-genomique/MaGuS. Its use can be extended to other high-throughput sequencing data (e.g., long-read data) and also to other map data (e.g., genetic maps) to improve the quality and the continuity of large and complex genome assemblies.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Kyle Fletcher ◽  
Lin Zhang ◽  
Juliana Gil ◽  
Rongkui Han ◽  
Keri Cavanaugh ◽  
...  

AbstractOur assembly-free linkage analysis pipeline (AFLAP) identifies segregating markers as k-mers in the raw reads without using a reference genome assembly for calling variants and provides genotype tables for the construction of unbiased, high-density genetic maps without a genome assembly. AFLAP is validated and contrasted to a conventional workflow using simulated data. AFLAP is applied to whole genome sequencing and genotype-by-sequencing data of F1, F2, and recombinant inbred populations of two different plant species, producing genetic maps that are concordant with genome assemblies. The AFLAP-based genetic map for Bremia lactucae enables the production of a chromosome-scale genome assembly.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1839 ◽  
Author(s):  
Tom O. Delmont ◽  
A. Murat Eren

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigradeHypsibius dujardini,and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome forH. dujardinisupported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.


2018 ◽  
Author(s):  
Timothy P. Bilton ◽  
Matthew R. Schofield ◽  
Michael A. Black ◽  
David Chagné ◽  
Phillip L. Wilcox ◽  
...  

ABSTRACTNext generation sequencing is an efficient method that allows for substantially more markers than previous technologies, providing opportunities for building high density genetic linkage maps, which facilitate the development of non-model species’ genomic assemblies and the investigation of their genes. However, constructing genetic maps using data generated via high-throughput sequencing technology (e.g., genotyping-by-sequencing) is complicated by the presence of sequencing errors and genotyping errors resulting from missing parental alleles due to low sequencing depth. If unaccounted for, these errors lead to inflated genetic maps. In addition, map construction in many species is performed using full-sib family populations derived from the outcrossing of two individuals, where unknown parental phase and varying segregation types further complicate construction. We present a new methodology for modeling low coverage sequencing data in the construction of genetic linkage maps using full-sib populations of diploid species, implemented in a package called GUSMap. Our model is based on an extension of the Lander-Green hidden Markov model that accounts for errors present in sequencing data. Results show that GUSMap was able to give accurate estimates of the recombination fractions and overall map distance, while most existing mapping packages produced inflated genetic maps in the presence of errors. Our results demonstrate the feasibility of using low coverage sequencing data to produce genetic maps without requiring extensive filtering of potentially erroneous genotypes, provided that the associated errors are correctly accounted for in the model.


2015 ◽  
Author(s):  
John Davey ◽  
Mathieu Chouteau ◽  
Sarah L. Barker ◽  
Luana Maroja ◽  
Simon W. Baxter ◽  
...  

The Heliconius butterflies are a widely studied adaptive radiation of 46 species spread across Central and South America, several of which are known to hybridise in the wild. Here, we present a substantially improved assembly of the Heliconius melpomene genome, developed using novel methods that should be applicable to improving other genome assemblies produced using short read sequencing. Firstly, we whole genome sequenced a pedigree to produce a linkage map incorporating 99% of the genome. Secondly, we incorporated haplotype scaffolds extensively to produce a more complete haploid version of the draft genome. Thirdly, we incorporated ~20x coverage of Pacific Biosciences sequencing and scaffolded the haploid genome using an assembly of this long read sequence. These improvements result in a genome of 795 scaffolds, 275 Mb in length, with an L50 of 2.1 Mb, an N50 of 34 and with 99% of the genome placed and 84% anchored on chromosomes. We use the new genome assembly to confirm that the Heliconius genome underwent 10 chromosome fusions since the split with its sister genus Eueides, over a period of about 6 million years.


Genetics ◽  
2018 ◽  
Vol 209 (1) ◽  
pp. 65-76 ◽  
Author(s):  
Timothy P. Bilton ◽  
Matthew R. Schofield ◽  
Michael A. Black ◽  
David Chagné ◽  
Phillip L. Wilcox ◽  
...  

2016 ◽  
Author(s):  
Tom O Delmont ◽  
A. Murat Eren

High-throughput sequencing provides a fast and cost effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini using approaches routinely employed by microbial ecologists who reconstruct bacterial and archaeal genomes from metagenomic data. We created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.


Author(s):  
Tom O Delmont ◽  
A. Murat Eren

High-throughput sequencing provides a fast and cost effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini using approaches routinely employed by microbial ecologists who reconstruct bacterial and archaeal genomes from metagenomic data. We created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.


GigaScience ◽  
2020 ◽  
Vol 9 (3) ◽  
Author(s):  
Shuang Jiang ◽  
Haishan An ◽  
Fangjie Xu ◽  
Xueying Zhang

Abstract Background The loquat (Eriobotrya japonica) is a species of flowering plant in the family Rosaceae that is widely cultivated in Asian, European, and African countries. It blossoms in the winter and ripens in the early summer. The genome of loquat has to date not been published, which limits the study of molecular biology in this cultivated species. Here, we used the third-generation sequencing technology of Nanopore and Hi-C technology to sequence the genome of Eriobotrya. Findings We generated 100.10 Gb of long reads using Oxford Nanopore sequencing technologies. Three types of Illumina high-throughput sequencing data, including genome short reads (47.42 Gb), transcriptome short reads (11.06 Gb), and Hi-C short reads (67.25 Gb), were also generated to help construct the loquat genome. All data were assembled into a 760.1-Mb genome assembly. The contigs were mapped to chromosomes by using Hi-C technology based on the contacts between contigs, and then a genome was assembled exhibiting 17 chromosomes and a scaffold N50 length of 39.7 Mb. A total of 45,743 protein-coding genes were annotated in the Eriobotrya genome, and we investigated the phylogenetic relationships between the Eriobotrya and 6 other Rosaceae species. Eriobotrya shows a close relationship with Malus and Pyrus, with the divergence time of Eriobotrya and Malus being 6.76 million years ago. Furthermore, chromosome rearrangement was found in Eriobotrya and Malus. Conclusions We constructed the first high-quality chromosome-level Eriobotrya genome using Illumina, Nanopore, and Hi-C technologies. This work provides a valuable reference genome for molecular studies of the loquat and provides new insight into chromosome evolution in this species.


Author(s):  
Zhijun Tong ◽  
Sanjie Jiang ◽  
Weiming He ◽  
Xuejun Chen ◽  
Lixin Yin ◽  
...  

Backcrossing is a powerful tool for plant breeding. The improved marker-assisted backcrossing intends to transfer targeted genes or quantitative trait loci (QTLs) of interest from a donor parent into a recurrent parent. In this study, a tobacco BC4F3 population was generated using Y3 and K326 as hybrid parents and YF1-1 as F<sub>1</sub> parents. High-throughput sequencing data of 381 pedigree populations were used to construct high-density genetic maps containing 24 142 high-quality single nucleotide polymorphism (SNP) markers with an average genetic distance of 0.59 cM. A genome module analysis was then performed for all the offspring. A total of forty-three candidate QTLs for six agronomics traits were identified. This study provides original biomarkers for tobacco breeding and offers clues for prospective backcrossing applications in other plants.


2020 ◽  
Author(s):  
Markus Hiltunen ◽  
Martin Ryberg ◽  
Hanna Johannesson

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online


Sign in / Sign up

Export Citation Format

Share Document