A Transcriptome Post-Scaffolding Method for Assembling High Quality Contigs

With the rapid development of high throughput sequencing technologies, new transcriptomes can be sequenced for little cost with high coverage. Sequence assembly approaches have been modified to meet the requirements for de novo transcriptomes, which have complications not found in traditional genome assemblies such as variation in coverage for each candidate mRNA and alternative splicing. As a consequence, de novo assembly strategies tend to generate a large number of redundant contigs due to sequence variations, which adversely affects downstream analysis and experiments. In this work we proposed TransPS, a transcriptome post-scaffolding method, to generate high quality, nonredundant de novo transcriptomes. TransPS shows promising results on the test transcriptome datasets, where redundancy is greatly reduced by more than 50% and, at the same time, coverage is improved considerably. The web server and source code are available.

Download Full-text

Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise

10.1101/2019.12.19.882399 ◽

2019 ◽

Cited By ~ 5

Author(s):

Valentina Peona ◽

Mozes P.K. Blom ◽

Luohao Xu ◽

Reto Burri ◽

Shawn Sullivan ◽

...

Keyword(s):

Dark Matter ◽

Genome Assembly ◽

Sex Chromosome ◽

De Novo ◽

Model Organism ◽

Technology Choice ◽

High Quality ◽

Sequencing Technologies ◽

Downstream Analysis ◽

Genome Assemblies

AbstractGenome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies have opened up a whole new world of genomic biodiversity. Although these technologies generate high-quality genome assemblies, there are still genomic regions difficult to assemble, like repetitive elements and GC-rich regions (genomic “dark matter”). In this study, we compare the efficiency of currently used sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter starting from the same sample. By adopting different de-novo assembly strategies, we were able to compare each individual draft assembly to a curated multiplatform one and identify the nature of the previously missing dark matter with a particular focus on transposable elements, multi-copy MHC genes, and GC-rich regions. Thanks to this multiplatform approach, we demonstrate the feasibility of producing a high-quality chromosome-level assembly for a non-model organism (paradise crow) for which only suboptimal samples are available. Our approach was able to reconstruct complex chromosomes like the repeat-rich W sex chromosome and several GC-rich microchromosomes. Telomere-to-telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects around the completeness of both the coding and non-coding parts of the genomes.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology

10.1101/2021.01.25.428044 ◽

2021 ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Rapid Development ◽

Plant Genome ◽

Plant Genomics ◽

High Quality ◽

High Molecular Weight Dna ◽

Tissue Samples ◽

Sequencing Technologies ◽

The Cost ◽

New Generation

AbstractRecent years have witnessed a rapid development of sequencing technologies. Fundamental differences and limitations among various platforms impact the time, the cost and the accuracy for sequencing whole genomes. Here we designed a complete de novo plant genome generation workflow that starts from plant tissue samples and produces high-quality draft genomes with relatively modest laboratory and bioinformatic resources within seven days. To optimize our workflow we selected different species of plants which were used to extract high molecular weight DNA, to make PacBio and ONT libraries for sequencing with the Sequel I, Sequel II and GridION platforms. We assembled high-quality draft genomes of two different Eucalyptus species E. rudis, and E. camaldulensis to chromosome level without using additional scaffolding technologies. For the rapid production of de novo genome assembly of plant species we showed that our DNA extraction protocol followed by PacBio high fidelity sequencing, and assembly with new generation assemblers such as hifiasm produce excellent results. Our findings will be a valuable benchmark for groups planning wet- and dry-lab plant genomics research and for high throughput plant genomics initiatives.

Download Full-text

High-Quality Assembly of an Individual of Yoruban Descent

10.1101/067447 ◽

2016 ◽

Cited By ~ 9

Author(s):

Karyn Meltz Steinberg ◽

Tina Graves Lindsay ◽

Valerie A. Schneider ◽

Mark J.P. Chaisson ◽

Chad Tomlinson ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Bac Library ◽

Segmental Duplications ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Genome Assemblies ◽

Complete Genomic

ABSTRACTDe novo assembly of human genomes is now a tractable effort due in part to advances in sequencing and mapping technologies. We use PacBio single-molecule, real-time (SMRT) sequencing and BioNano genomic maps to construct the first de novo assembly of NA19240, a Yoruban individual from Africa. This chromosome-scaffolded assembly of 3.08 Gb with a contig N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb represents one of the most contiguous high-quality human genomes. We utilize a BAC library derived from NA19240 DNA and novel haplotype-resolving sequencing technologies and algorithms to characterize regions of complex genomic architecture that are normally lost due to compression to a linear haploid assembly. Our results demonstrate that multiple technologies are still necessary for complete genomic representation, particularly in regions of highly identical segmental duplications. Additionally, we show that diploid assembly has utility in improving the quality of de novo human genome assemblies.

Download Full-text

A comprehensive review of scaffolding methods in genome assembly

Briefings in Bioinformatics ◽

10.1093/bib/bbab033 ◽

2021 ◽

Author(s):

Junwei Luo ◽

Yawei Wei ◽

Mengna Lyu ◽

Zhengjiang Wu ◽

Xiaoyan Liu ◽

...

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Rapid Development ◽

Genomic Research ◽

Future Research ◽

Sequencing Data ◽

Sequencing Technologies ◽

Biological Studies ◽

Downstream Analysis

Abstract In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

Higher quality de novo genome assemblies from degraded museum specimens: a linked-read approach to museomics

10.1101/716506 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jocelyn P. Colella ◽

Anna Tigano ◽

Matthew D. MacManes

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Deer Mouse ◽

Cost Effective ◽

Molecular Data ◽

Degraded Dna ◽

Museum Specimens ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies

AbstractHigh-throughput sequencing technologies are a proposed solution for accessing the molecular data in historic specimens. However, degraded DNA combined with the computational demands of short-read assemblies has posed significant laboratory and bioinformatics challenges. Linked-read or ‘synthetic long-read’ sequencing technologies, such as 10X Genomics, may provide a cost-effective alternative solution to assemble higher quality de novo genomes from degraded specimens. Here, we compare assembly quality (e.g., genome contiguity and completeness, presence of orthogroups) between four published genomes assembled from a single shotgun library and four deer mouse (Peromyscus spp.) genomes assembled using 10X Genomics technology. At a similar price-point, these approaches produce vastly different assemblies, with linked-read assemblies having overall higher quality, measured by larger N50 values and greater gene content. Although not without caveats, our results suggest that linked-read sequencing technologies may represent a viable option to build de novo genomes from historic museum specimens, which may prove particularly valuable for extinct, rare, or difficult to collect taxa.

Download Full-text

De Novo SNP Discovery and Genotyping of Iranian Pimpinella Species Using ddRAD Sequencing

Agronomy ◽

10.3390/agronomy11071342 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1342

Author(s):

Shaghayegh Mehravi ◽

Gholam Ali Ranjbar ◽

Ghader Mirzaghaderi ◽

Anita Alice Severn-Ellis ◽

Armin Scheben ◽

...

Keyword(s):

De Novo ◽

Genetic Relationships ◽

Nucleotide Polymorphisms ◽

High Quality ◽

Genomic Resources ◽

High Quality Snps ◽

The Family ◽

Double Digestion ◽

Flanking Sequences ◽

Downstream Analysis

The species of Pimpinella, one of the largest genera of the family Apiaceae, are traditionally cultivated for medicinal purposes. In this study, high-throughput double digest restriction-site associated DNA sequencing technology (ddRAD-seq) was used to identify single nucleotide polymorphisms (SNPs) in eight Pimpinella species from Iran. After double-digestion with the enzymes HpyCH4IV and HinfI, a total of 334,702,966 paired-end reads were de novo assembled into 1,270,791 loci with an average of 28.8 reads per locus. After stringent filtering, 2440 high-quality SNPs were identified for downstream analysis. Analysis of genetic relationships and population structure, based on these retained SNPs, indicated the presence of three major groups. Gene ontology and pathway analysis were determined by using comparison SNP-associated flanking sequences with a public non-redundant database. Due to the lack of genomic resources in this genus, our present study is the first report to provide high-quality SNPs in Pimpinella based on a de novo analysis pipeline using ddRAD-seq. This data will enhance the molecular knowledge of the genus Pimpinella and will provide an important source of information for breeders and the research community to enhance breeding programs and support the management of Pimpinella genomic resources.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

High-Quality Genome Assembly of Peronospora destructor, the Causal Agent of Onion Downy Mildew

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-10-19-0280-a ◽

2020 ◽

Vol 33 (5) ◽

pp. 718-720

Author(s):

Karthi Natesan ◽

Ji Yeon Park ◽

Cheol-Woo Kim ◽

Dong Suk Park ◽

Young-Seok Kwon ◽

...

Keyword(s):

Downy Mildew ◽

De Novo ◽

Gc Content ◽

Comparative Genomic ◽

High Quality ◽

Sequencing Platform ◽

Peronospora Destructor ◽

Genomic Studies ◽

Genome Assemblies ◽

High Quality Genome

Peronospora destructor is an obligate biotrophic oomycete that causes downy mildew on onion (Allium cepa). Onion is an important crop worldwide, but its production is affected by this pathogen. We sequenced the genome of P. destructor using the PacBio sequencing platform, and de novo assembly resulted in 74 contigs with a total contig size of 29.3 Mb and 48.48% GC content. Here, we report the first high-quality genome sequence of P. destructor and its comparison with the genome assemblies of other oomycetes. The genome is a very useful resource to serve as a reference for analysis of P. destructor isolates and for comparative genomic studies of the biotrophic oomycetes.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text