Quinoa genome assembly employing genomic variation for guided scaffolding

Abstract Key message We propose to use the natural variation between individuals of a population for genome assembly scaffolding. In today’s genome projects, multiple accessions get sequenced, leading to variant catalogs. Using such information to improve genome assemblies is attractive both cost-wise as well as scientifically, because the value of an assembly increases with its contiguity. We conclude that haplotype information is a valuable resource to group and order contigs toward the generation of pseudomolecules. Abstract Quinoa (Chenopodium quinoa) has been under cultivation in Latin America for more than 7500 years. Recently, quinoa has gained increasing attention due to its stress resistance and its nutritional value. We generated a novel quinoa genome assembly for the Bolivian accession CHEN125 using PacBio long-read sequencing data (assembly size 1.32 Gbp, initial N50 size 608 kbp). Next, we re-sequenced 50 quinoa accessions from Peru and Bolivia. This set of accessions differed at 4.4 million single-nucleotide variant (SNV) positions compared to CHEN125 (1.4 million SNV positions on average per accession). We show how to exploit variation in accessions that are distantly related to establish a genome-wide ordered set of contigs for guided scaffolding of a reference assembly. The method is based on detecting shared haplotypes and their expected continuity throughout the genome (i.e., the effect of linkage disequilibrium), as an extension of what is expected in mapping populations where only a few haplotypes are present. We test the approach using Arabidopsis thaliana data from different populations. After applying the method on our CHEN125 quinoa assembly we validated the results with mate-pairs, genetic markers, and another quinoa assembly originating from a Chilean cultivar. We show consistency between these information sources and the haplotype-based relations as determined by us and obtain an improved assembly with an N50 size of 1079 kbp and ordered contig groups of up to 39.7 Mbp. We conclude that haplotype information in distantly related individuals of the same species is a valuable resource to group and order contigs according to their adjacency in the genome toward the generation of pseudomolecules.

Download Full-text

AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data

Genome Biology ◽

10.1186/s13059-021-02326-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Kyle Fletcher ◽

Lin Zhang ◽

Juliana Gil ◽

Rongkui Han ◽

Keri Cavanaugh ◽

...

Keyword(s):

Linkage Analysis ◽

Genome Sequencing ◽

Genome Assembly ◽

Simulated Data ◽

Genetic Maps ◽

Sequencing Data ◽

Analysis Pipeline ◽

A Genome ◽

Genotype By Sequencing ◽

Genome Assemblies

AbstractOur assembly-free linkage analysis pipeline (AFLAP) identifies segregating markers as k-mers in the raw reads without using a reference genome assembly for calling variants and provides genotype tables for the construction of unbiased, high-density genetic maps without a genome assembly. AFLAP is validated and contrasted to a conventional workflow using simulated data. AFLAP is applied to whole genome sequencing and genotype-by-sequencing data of F1, F2, and recombinant inbred populations of two different plant species, producing genetic maps that are concordant with genome assemblies. The AFLAP-based genetic map for Bremia lactucae enables the production of a chromosome-scale genome assembly.

Download Full-text

Dense and accurate whole-chromosome haplotyping of individual genomes

10.1101/126136 ◽

2017 ◽

Cited By ~ 1

Author(s):

David Porubsky ◽

Shilpa Garg ◽

Ashley D. Sanders ◽

Jan O. Korbel ◽

Victor Guryev ◽

...

Keyword(s):

Target Genes ◽

Chromosome Length ◽

Single Individual ◽

Sequencing Data ◽

Individual Genome ◽

Sequencing Technologies ◽

Biological Phenomena ◽

Genome Wide ◽

A Genome ◽

Long Read

ABSTRACTThe diploid nature of the genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. Many important biological phenomena such as compound heterozygosity and epistatic effects between enhancers and target genes, however, can only be studied when haplotype-resolved genomes are available. This lack of haplotype-level analyses can be explained by a dearth of methods to produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. Our experiments provide comprehensive guidance on favorable combinations of Strand-seq libraries and sequencing coverages to obtain complete and genome-wide haplotypes of a single individual genome (NA12878) at manageable costs. We were able to reliably assign > 95% of alleles to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different sequencing technologies represents an attractive solution to chart the unique genetic variation of diploid genomes.

Download Full-text

Major improvements to the Heliconius melpomene genome assembly used to confirm 10 chromosome fusion events in 6 million years of butterfly evolution

10.1101/029199 ◽

2015 ◽

Author(s):

John Davey ◽

Mathieu Chouteau ◽

Sarah L. Barker ◽

Luana Maroja ◽

Simon W. Baxter ◽

...

Keyword(s):

Genome Assembly ◽

Draft Genome ◽

Chromosome Fusion ◽

Short Read Sequencing ◽

A Genome ◽

Heliconius Melpomene ◽

Long Read ◽

In The Wild ◽

Chromosome Fusions ◽

Genome Assemblies

The Heliconius butterflies are a widely studied adaptive radiation of 46 species spread across Central and South America, several of which are known to hybridise in the wild. Here, we present a substantially improved assembly of the Heliconius melpomene genome, developed using novel methods that should be applicable to improving other genome assemblies produced using short read sequencing. Firstly, we whole genome sequenced a pedigree to produce a linkage map incorporating 99% of the genome. Secondly, we incorporated haplotype scaffolds extensively to produce a more complete haploid version of the draft genome. Thirdly, we incorporated ~20x coverage of Pacific Biosciences sequencing and scaffolded the haploid genome using an assembly of this long read sequence. These improvements result in a genome of 795 scaffolds, 275 Mb in length, with an L50 of 2.1 Mb, an N50 of 34 and with 99% of the genome placed and 84% anchored on chromosomes. We use the new genome assembly to confirm that the Heliconius genome underwent 10 chromosome fusions since the split with its sister genus Eueides, over a period of about 6 million years.

Download Full-text

DNAModAnnot: a R toolbox for DNA modification filtering and annotation

Bioinformatics ◽

10.1093/bioinformatics/btab032 ◽

2021 ◽

Author(s):

Alexis Hardy ◽

Mélody Matelot ◽

Amandine Touzeau ◽

Christophe Klopp ◽

Céline Lopez-Roques ◽

...

Keyword(s):

Global Analysis ◽

R Package ◽

Supplementary Information ◽

Dna Modification ◽

Paramecium Tetraurelia ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Dna Modifications ◽

Long Read

Abstract Motivation Long-read sequencing technologies can be employed to detect and map DNA modifications at the nucleotide resolution on a genome-wide scale. However, published software packages neglect the integration of genomic annotation and comprehensive filtering when analyzing patterns of modified bases detected using Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data. Here, we present DNAModAnnot, a R package designed for the global analysis of DNA modification patterns using adapted filtering and visualization tools. Results We tested our package using PacBio sequencing data to analyze patterns of the 6-methyladenine (6 mA) in the ciliate Paramecium tetraurelia, in which high 6 mA amounts were previously reported. We found Paramecium tetraurelia 6 mA genome-wide distribution to be similar to other ciliates. We also performed 5-methylcytosine (5mC) analysis in human lymphoblastoid cells using ONT data and confirmed previously known patterns of 5mC. DNAModAnnot provides a toolbox for the genome-wide analysis of different DNA modifications using PacBio and ONT long-read sequencing data. Availability DNAModAnnot is distributed as a R package available via GitHub (https://github.com/AlexisHardy/DNAModAnnot) Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Pathways to polar adaptation in fishes revealed by long-read sequencing

10.1101/2021.11.12.468413 ◽

2021 ◽

Author(s):

Scott Hotaling ◽

Thomas Desvignes ◽

John S. Sproul ◽

Luana S.F. Lins ◽

Joanna L Kelley

Keyword(s):

Genome Wide ◽

A Genome ◽

Cebidichthys Violaceus ◽

Long Read ◽

Wide Perspective ◽

Element Activity ◽

Genomic Regions ◽

Genome Assemblies ◽

Response To Environmental Change ◽

Globally Distributed

Long-read sequencing is driving a new reality for genome science where highly contiguous assemblies can be produced efficiently with modest resources. Genome assemblies from long-read sequencing are particularly exciting for understanding the evolution of complex genomic regions that are often difficult to assemble. In this study, we leveraged long-read sequencing to generate a high-quality genome assembly for an Antarctic eelpout, Opthalmolycus amberensis, the first for the globally distributed family Zoarcidae. We used this assembly to understand how O. amberensis has adapted to the harsh Southern Ocean and compared it to another group of Antarctic fishes: the notothenioids. We showed that from a genome-wide perspective, selection has largely acted on different targets in eelpouts relative to notothenioids. However, we did find some overlap; in both groups, selection has acted on genes involved in membrane structure and DNA repair. We found evidence for historical shifts of transposable element activity in O. amberensis and other polar fishes, perhaps reflecting a response to environmental change. We were specifically interested in the evolution of two complex genomic regions known to underlie key adaptations to polar seas: hemoglobin and antifreeze proteins (AFPs). We observed unique evolution of the hemoglobin MN cluster in eelpouts and related fishes in the suborder Zoarcoidei relative to other teleosts. For AFPs, we identified the first species in the suborder with no evidence of afpIII sequences (Cebidichthys violaceus), potentially reflecting a lineage-specific loss of this gene cluster. Beyond polar fishes, our results highlight the power of long-read sequencing to understand genome evolution.

Download Full-text

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population

BMC Medical Genomics ◽

10.1186/s12920-020-00853-3 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C. Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using . Length variation of known disease-associated repeats was compared to other repeat loci. Results We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5′UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

An Improved Genome Assembly of the European Aspen Populus tremula

10.1101/805614 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bastian Schiffthaler ◽

Nicolas Delhomme ◽

Carolina Bernhardsson ◽

Jerry Jenkins ◽

Stefan Jansson ◽

...

Keyword(s):

Genome Assembly ◽

Association Studies ◽

Genomic Variation ◽

Populus Tremula ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Short Read ◽

Genome Wide ◽

European Aspen ◽

Long Read

ABSTRACTThe genome assembly of the European aspen Populus tremula proved difficult for a short-read based strategy due to high genomic variation. As a consequence, the fragmented sequence is impeding studies that benefit from highly contiguous data, particularly genome-wide association studies (GWAS) and comparative genomics. Here we present an updated assembly based on long-read sequences, optical mapping and genetic mapping. This assembly - henceforth referred to as Potra V2 - is assembled into 19 contiguous chromosomes which provides a powerful tool for future association studies. The genome sequence and any feature files are available from the PopGenIE resource.

Download Full-text

MaGuS: a tool for map-guided scaffolding and quality assessment of genome assemblies

10.1101/032045 ◽

2015 ◽

Author(s):

Mohammed-Amin Madoui ◽

Carole Dossat ◽

Leo d'Agata ◽

Edwin van der Vossen ◽

Jan van Oeveren ◽

...

Keyword(s):

High Throughput ◽

Genome Assembly ◽

High Throughput Sequencing ◽

Draft Genome ◽

Genetic Maps ◽

Sequencing Data ◽

A Genome ◽

Genome Map ◽

Genome Assemblies ◽

Complex Genome

Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring. Results We present MaGuS (map-guided scaffolding), a modular tool that uses a draft genome assembly, a genome map, and high-throughput paired-end sequencing data to estimate the quality and to enhance the continuity of an assembly. We generated several assemblies of the Arabidopsis genome using different scaffolding programs and applied MaGuS to select the best assembly using quality metrics. Then, we used MaGuS to perform map-guided scaffolding to increase continuity by creating new scaffold links in low-covered and highly repetitive regions where other commonly used scaffolding methods lack consistency. Conclusions MaGuS is a powerful reference-free evaluator of assembly quality and a map-guided scaffolder that is freely available at https://github.com/institut-de-genomique/MaGuS. Its use can be extended to other high-throughput sequencing data (e.g., long-read data) and also to other map data (e.g., genetic maps) to improve the quality and the continuity of large and complex genome assemblies.

Download Full-text

ARBitR: An overlap-aware genome assembly scaffolder for linked reads

10.1101/2020.04.29.065847 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Long Read ◽

Genome Assemblies ◽

General Public License

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online

Download Full-text

Genome-wide Survey of Tandem Repeats by Nanopore Sequencing Shows that Disease-associated Repeats are More Polymorphic in the General Population

10.21203/rs.3.rs-79348/v1 ◽

2020 ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background: Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods: We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using tandem-genotypes. Length variation of known disease-associated repeats was compared to other repeat loci. Results: We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5'UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions: We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study (GWAS), may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text