AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data

AbstractOur assembly-free linkage analysis pipeline (AFLAP) identifies segregating markers as k-mers in the raw reads without using a reference genome assembly for calling variants and provides genotype tables for the construction of unbiased, high-density genetic maps without a genome assembly. AFLAP is validated and contrasted to a conventional workflow using simulated data. AFLAP is applied to whole genome sequencing and genotype-by-sequencing data of F1, F2, and recombinant inbred populations of two different plant species, producing genetic maps that are concordant with genome assemblies. The AFLAP-based genetic map for Bremia lactucae enables the production of a chromosome-scale genome assembly.

Download Full-text

AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data

10.1101/2020.09.14.296525 ◽

2020 ◽

Author(s):

Kyle Fletcher ◽

Lin Zhang ◽

Juliana Gil ◽

Rongkui Han ◽

Keri Cavanaugh ◽

...

Keyword(s):

Linkage Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genetic Map ◽

Genotyping By Sequencing ◽

Genetic Maps ◽

Whole Genome ◽

Sequencing Data ◽

Analysis Pipeline ◽

Genome Assemblies

AbstractBackgroundGenetic maps are an important resource for validation of genome assemblies, trait discovery, and breeding. Next generation sequencing has enabled production of high-density genetic maps constructed with 10,000s of markers. Most current approaches require a genome assembly to identify markers. Our Assembly Free Linkage Analysis Pipeline (AFLAP) removes this requirement by using uniquely segregating k-mers as markers to rapidly construct a genotype table and perform subsequent linkage analysis. This avoids potential biases including preferential read alignment and variant calling.ResultsThe performance of AFLAP was determined in simulations and contrasted to a conventional workflow. We tested AFLAP using 100 F2 individuals of Arabidopsis thaliana, sequenced to low coverage. Genetic maps generated using k-mers contained over 130,000 markers that were concordant with the genomic assembly. The utility of AFLAP was then demonstrated by generating an accurate genetic map using genotyping-by-sequencing data of 235 recombinant inbred lines of Lactuca spp. AFLAP was then applied to 83 F1 individuals of the oomycete Bremia lactucae, sequenced to >5x coverage. The genetic map contained over 90,000 markers ordered in 19 large linkage groups. This genetic map was used to fragment, order, orient, and scaffold the genome, resulting in a much-improved reference assembly.ConclusionsAFLAP can be used to generate high density linkage maps and improve genome assemblies of any organism when a mapping population is available using whole genome sequencing or genotyping-by-sequencing data. Genetic maps produced for B. lactucae were accurately aligned to the genome and guided significant improvements of the reference assembly.

Download Full-text

MaGuS: a tool for map-guided scaffolding and quality assessment of genome assemblies

10.1101/032045 ◽

2015 ◽

Author(s):

Mohammed-Amin Madoui ◽

Carole Dossat ◽

Leo d'Agata ◽

Edwin van der Vossen ◽

Jan van Oeveren ◽

...

Keyword(s):

High Throughput ◽

Genome Assembly ◽

High Throughput Sequencing ◽

Draft Genome ◽

Genetic Maps ◽

Sequencing Data ◽

A Genome ◽

Genome Map ◽

Genome Assemblies ◽

Complex Genome

Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring. Results We present MaGuS (map-guided scaffolding), a modular tool that uses a draft genome assembly, a genome map, and high-throughput paired-end sequencing data to estimate the quality and to enhance the continuity of an assembly. We generated several assemblies of the Arabidopsis genome using different scaffolding programs and applied MaGuS to select the best assembly using quality metrics. Then, we used MaGuS to perform map-guided scaffolding to increase continuity by creating new scaffold links in low-covered and highly repetitive regions where other commonly used scaffolding methods lack consistency. Conclusions MaGuS is a powerful reference-free evaluator of assembly quality and a map-guided scaffolder that is freely available at https://github.com/institut-de-genomique/MaGuS. Its use can be extended to other high-throughput sequencing data (e.g., long-read data) and also to other map data (e.g., genetic maps) to improve the quality and the continuity of large and complex genome assemblies.

Download Full-text

Quinoa genome assembly employing genomic variation for guided scaffolding

Theoretical and Applied Genetics ◽

10.1007/s00122-021-03915-x ◽

2021 ◽

Author(s):

Alexandrina Bodrug-Schepers ◽

Nancy Stralis-Pavese ◽

Hermann Buerstmayr ◽

Juliane C. Dohm ◽

Heinz Himmelbauer

Keyword(s):

Genome Assembly ◽

Chenopodium Quinoa ◽

Genomic Variation ◽

Valuable Resource ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

Haplotype Information

Abstract Key message We propose to use the natural variation between individuals of a population for genome assembly scaffolding. In today’s genome projects, multiple accessions get sequenced, leading to variant catalogs. Using such information to improve genome assemblies is attractive both cost-wise as well as scientifically, because the value of an assembly increases with its contiguity. We conclude that haplotype information is a valuable resource to group and order contigs toward the generation of pseudomolecules. Abstract Quinoa (Chenopodium quinoa) has been under cultivation in Latin America for more than 7500 years. Recently, quinoa has gained increasing attention due to its stress resistance and its nutritional value. We generated a novel quinoa genome assembly for the Bolivian accession CHEN125 using PacBio long-read sequencing data (assembly size 1.32 Gbp, initial N50 size 608 kbp). Next, we re-sequenced 50 quinoa accessions from Peru and Bolivia. This set of accessions differed at 4.4 million single-nucleotide variant (SNV) positions compared to CHEN125 (1.4 million SNV positions on average per accession). We show how to exploit variation in accessions that are distantly related to establish a genome-wide ordered set of contigs for guided scaffolding of a reference assembly. The method is based on detecting shared haplotypes and their expected continuity throughout the genome (i.e., the effect of linkage disequilibrium), as an extension of what is expected in mapping populations where only a few haplotypes are present. We test the approach using Arabidopsis thaliana data from different populations. After applying the method on our CHEN125 quinoa assembly we validated the results with mate-pairs, genetic markers, and another quinoa assembly originating from a Chilean cultivar. We show consistency between these information sources and the haplotype-based relations as determined by us and obtain an improved assembly with an N50 size of 1079 kbp and ordered contig groups of up to 39.7 Mbp. We conclude that haplotype information in distantly related individuals of the same species is a valuable resource to group and order contigs according to their adjacency in the genome toward the generation of pseudomolecules.

Download Full-text

Local Ancestry Prediction with PyLAE

10.1101/2020.11.13.380105 ◽

2020 ◽

Author(s):

Alexander Smetanin ◽

Nikita Moshkov ◽

Tatiana V. Tatarinova

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Computational Efficiency ◽

Source Code ◽

High Density ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Local Ancestry ◽

A Genome

AbstractSummaryWe developed PyLAE - a new tool for determining local ancestry along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). Since PyLAE does not involve estimation of many parameters, it can process thousands of genomes within a day. Computational efficiency, straightforward presentation of results, and an ease of installation makes PyLAE a useful tool to study admixed populations.Availability and implementationThe source code and installation manual are available at https://github.com/smetam/pylae.

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

Challenging a bioinformatic tool’s ability to detect microbial contaminants usingin silicowhole genome sequencing data

PeerJ ◽

10.7717/peerj.3729 ◽

2017 ◽

Vol 5 ◽

pp. e3729 ◽

Cited By ~ 6

Author(s):

Nathan D. Olson ◽

Justin M. Zook ◽

Jayne B. Morrow ◽

Nancy J. Lin

Keyword(s):

Binary Mixtures ◽

Genome Sequencing ◽

Pathogen Detection ◽

A Priori ◽

High Sensitivity ◽

False Positives ◽

Sequencing Data ◽

Materials Used ◽

The Individual ◽

Genome Assemblies

High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need fora prioriassumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, withStaphylococcus,Escherichia, andShigellahaving the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in thein-silicodatasets at the equivalent of 1 in 1,000 cells, thoughF. tularensiswas not detected in any of the simulated contaminant mixtures andY. pestiswas only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods.

Download Full-text

Simulating the Dynamics of Targeted Capture Sequencing with CapSim

10.1101/134510 ◽

2017 ◽

Cited By ~ 1

Author(s):

Minh Duc Cao ◽

Devika Ganesamoorthy ◽

Lachlan J.M. Coin

Keyword(s):

Statistical Power ◽

Simulated Data ◽

Targeted Sequencing ◽

Design Parameters ◽

Probe Design ◽

Analysis Pipeline ◽

A Genome ◽

Targeted Capture ◽

Sequencing Platforms ◽

Sequencing Process

AbstractMotivationTargeted sequencing using capture probes has become increasingly popular in clinical applications due to its scalability and cost-effectiveness. The approach also allows for higher sequencing coverage of the targeted regions resulting in better analysis statistical power. However, because of the dynamics of the hybridisation process, it is difficult to evaluate the efficiency of the probe design prior to the experiments which are time consuming and costly.ResultsWe developed CapSim, a software package for simulation of targeted sequencing. Given a genome sequence and a set of probes, CapSim simulates the fragmentation, the dynamics of probe hybridisation, and the sequencing of the captured fragments on Illumina and PacBio sequencing platforms. The simulated data can be used for evaluating the performance of the analysis pipeline, as well as the efficiency of the probe design. Parameters of the various stages in the sequencing process can also be evaluated in order to optimise the efficacy of the experiments.AvailabilityCapSim is publicly available under BSD license at https://github.com/mdcao/capsim.

Download Full-text

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Communications Biology ◽

10.1038/s42003-021-02559-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Caroline Belser ◽

Franc-Christophe Baurens ◽

Benjamin Noel ◽

Guillaume Martin ◽

Corinne Cruaud ◽

...

Keyword(s):

Musa Acuminata ◽

Genetic Maps ◽

Nanopore Sequencing ◽

Genome Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

First Time

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75 kbp. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.

Download Full-text

Comparison of MiSeq, MinION, and hybrid genome sequencing for analysis of Campylobacter jejuni

Scientific Reports ◽

10.1038/s41598-021-84956-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Jason M. Neal-McKinney ◽

Kun C. Liu ◽

Christopher M. Lock ◽

Wen-Hsin Wu ◽

Jinxin Hu

Keyword(s):

Genome Sequencing ◽

Sequence Data ◽

Bacterial Genome ◽

Illumina Miseq ◽

Trna Genes ◽

Sequencing Data ◽

Data Types ◽

Field Isolates ◽

Hybrid Genome ◽

Genome Assemblies

AbstractThe sequencing, assembly, and analysis of bacterial genomes is central to tracking and characterizing foodborne pathogens. The bulk of bacterial genome sequencing at the US Food and Drug Administration is performed using short-read Illumina MiSeq technology, resulting in highly accurate but fragmented genomic sequences. The MinION sequencer from Oxford Nanopore is an evolving technology that produces long-read sequencing data with low equipment cost. The goal of this study was to compare Campylobacter genome assemblies generated from MiSeq and MinION data independently, as well as hybrid genome assemblies combining both data types. Two reference strains and two field isolates of C. jejuni were sequenced using MiSeq and MinION, and the sequence data were assembled using the software programs SPAdes and Canu, respectively. Hybrid genome assembly was performed using the program Unicycler. Comparison of the C. jejuni 81-176 and RM1221 genome assemblies to the PacBio reference genomes revealed that the SPAdes assemblies had the most accurate nucleotide identity, while the hybrid assemblies were the most contiguous. Assemblies generated only from MinION data using Canu were the least accurate, containing many indels and substitutions that affected downstream analyses. The hybrid sequencing approach was the most useful for detecting plasmids, large genome rearrangements, and repetitive elements such as rRNA and tRNA genes. The full genomes of both C. jejuni field isolates were completed and circularized using hybrid sequencing, and a plasmid was detected in one isolate. Continued development of nanopore sequencing technologies will likely enhance the accuracy of hybrid genome assemblies and enable public health laboratories to routinely generate complete circularized bacterial genome sequences.

Download Full-text

Local ancestry prediction with PyLAE

PeerJ ◽

10.7717/peerj.12502 ◽

2021 ◽

Vol 9 ◽

pp. e12502

Author(s):

Nikita Moshkov ◽

Aleksandr Smetanin ◽

Tatiana V. Tatarinova

Keyword(s):

Genome Sequencing ◽

Gold Standard ◽

Source Code ◽

Genomic Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Local Ancestry ◽

1000 Genomes ◽

A Genome

Summary We developed PyLAE, a new tool for determining local ancestry along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). Since PyLAE does not involve estimating many parameters, it can process thousands of genomes within a day. PyLAE can run on phased or unphased genomic data. We have shown how PyLAE can be applied to the identification of differentially enriched pathways between populations. The local ancestry approach results in higher enrichment scores compared to whole-genome approaches. We benchmarked PyLAE using the 1000 Genomes dataset, comparing the aggregated predictions with the global admixture results and the current gold standard program RFMix. Computational efficiency, minimal requirements for data pre-processing, straightforward presentation of results, and ease of installation make PyLAE a valuable tool to study admixed populations. Availability and implementation The source code and installation manual are available at https://github.com/smetam/pylae.

Download Full-text