HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

Abstract Background Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. Results Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). Conclusions HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.

Download Full-text

HapSolo: An optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

10.1101/2020.06.29.178848 ◽

2020 ◽

Author(s):

Edwin A. Solares ◽

Yuan Tao ◽

Anthony D. Long ◽

Brandon S. Gaut

Keyword(s):

Cost Function ◽

Anopheles Funestus ◽

Hill Climbing ◽

Optimization Approach ◽

Sequencing Technology ◽

Genome Data ◽

A Genome ◽

Long Read ◽

Downstream Analysis ◽

The Cost

ABSTRACTBackgroundDespite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding.ResultsHere we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490Mb, a mosquito (Anopheles funestus; 200Mb) and the Thorny Skate (Amblyraja radiata; 2,650 Mb).ConclusionsHapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~84%. The improvements for mosquito scaffolding were similar to that of Chardonnay (from 61% to 86%), but even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on another published method for identifying secondary contigs, with generally superior results for HapSolo.

Download Full-text

PacBio sequencing output increased through uniform and directional fivefold concatenation

Scientific Reports ◽

10.1038/s41598-021-96829-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nisha Kanwar ◽

Celia Blanco ◽

Irene A. Chen ◽

Burckhard Seelig

Keyword(s):

Improved Method ◽

Sequencing Technology ◽

Pacbio Sequencing ◽

Limited Sample ◽

Analysis Pipeline ◽

Medium Length ◽

Engineering Study ◽

Long Read ◽

The Cost ◽

User Friendly

AbstractAdvances in sequencing technology have allowed researchers to sequence DNA with greater ease and at decreasing costs. Main developments have focused on either sequencing many short sequences or fewer large sequences. Methods for sequencing mid-sized sequences of 600–5,000 bp are currently less efficient. For example, the PacBio Sequel I system yields ~ 100,000–300,000 reads with an accuracy per base pair of 90–99%. We sought to sequence several DNA populations of ~ 870 bp in length with a sequencing accuracy of 99% and to the greatest depth possible. We optimised a simple, robust method to concatenate genes of ~ 870 bp five times and then sequenced the resulting DNA of ~ 5,000 bp by PacBioSMRT long-read sequencing. Our method improved upon previously published concatenation attempts, leading to a greater sequencing depth, high-quality reads and limited sample preparation at little expense. We applied this efficient concatenation protocol to sequence nine DNA populations from a protein engineering study. The improved method is accompanied by a simple and user-friendly analysis pipeline, DeCatCounter, to sequence medium-length sequences efficiently at one-fifth of the cost.

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text

Improved contiguity of the threespine stickleback genome using long-read sequencing

10.1101/2020.06.30.170787 ◽

2020 ◽

Cited By ~ 1

Author(s):

Shivangi Nath ◽

Daniel E. Shaw ◽

Michael A. White

Keyword(s):

Gasterosteus Aculeatus ◽

Genetic Model ◽

Threespine Stickleback ◽

Model Species ◽

Long Distance ◽

Sequencing Technologies ◽

A Genome ◽

Long Read ◽

The Cost ◽

Stickleback Genome

AbstractWhile the cost and time for assembling a genome have drastically reduced, it still remains a challenge to assemble a highly contiguous genome. These challenges are rapidly being overcome by the integration of long-read sequencing technologies. Here, we use long sequencing reads to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species. Using Pacific Biosciences sequencing, we were able to fill over 76% of the gaps in the genome, improving contiguity over five-fold. Our approach was highly accurate, validated by 10X Genomics long-distance linked-reads. In addition to closing a majority of gaps, we were able to assemble segments of telomeres and centromeres throughout the genome. This highlights the power of using long sequencing reads to assemble highly repetitive and difficult to assemble regions of genomes. This latest genome build has been released through a newly designed community genome browser that aims to consolidate the growing number of genomics datasets available for the threespine stickleback fish.

Download Full-text

Efficient iterative Hi-C scaffolder based on N-best neighbors

BMC Bioinformatics ◽

10.1186/s12859-021-04453-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Zemin Ning ◽

Guohua Wang ◽

Yadong Wang ◽

...

Keyword(s):

De Novo ◽

A Priori ◽

Sequencing Technology ◽

Current Standard ◽

A Genome ◽

Eukaryotic Species ◽

Long Read ◽

Reference Quality ◽

Comparable Accuracy ◽

Chromosomal Profile

Abstract Background Efficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies. While long read data itself is unlikely to create a chromosome-scale assembly for most eukaryotic species, the inexpensive Hi-C sequencing technology, capable of capturing the chromosomal profile of a genome, is now widely used to complete the task. However, the existing Hi-C based scaffolding tools either require a priori chromosome number as input, or lack the ability to build highly continuous scaffolds. Results We design and develop a novel Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. Subsequent to scaffolding, it identifies potential misjoins and breaks them to keep the scaffolding accuracy. Through our tests on three long read based de novo assemblies from three different species, we demonstrate that pin_hic is more efficient than current standard state-of-art tools, and it can generate much more continuous scaffolds, while achieving a higher or comparable accuracy. Conclusions Pin_hic is an efficient Hi-C based scaffolding tool, which can be useful for building chromosome-scale assemblies. As many sequencing projects have been launched in the recent years, we believe pin_hic has potential to be applied in these projects and makes a meaningful contribution.

Download Full-text

QAlign: aligning nanopore reads accurately using current-level modeling

Bioinformatics ◽

10.1093/bioinformatics/btaa875 ◽

2020 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

Supplementary Information ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

Abstract Motivation Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner. Results We show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets. Availability and implementation https://github.com/joshidhaivat/QAlign.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Improved contiguity of the threespine stickleback genome using long-read sequencing

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab007 ◽

2021 ◽

Vol 11 (2) ◽

Author(s):

Shivangi Nath ◽

Daniel E Shaw ◽

Michael A White

Keyword(s):

Gasterosteus Aculeatus ◽

Genetic Model ◽

Threespine Stickleback ◽

Long Distance ◽

Sequencing Technologies ◽

Reference Genome Assembly ◽

A Genome ◽

Long Read ◽

The Cost ◽

Stickleback Genome

Abstract While the cost and time for assembling a genome has drastically decreased, it still remains a challenge to assemble a highly contiguous genome. These challenges are rapidly being overcome by the integration of long-read sequencing technologies. Here, we use long-read sequencing to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species. Using Pacific Biosciences sequencing, we assembled a highly contiguous genome of a freshwater fish from Paxton Lake. Using contigs from this genome, we were able to fill over 76.7% of the gaps in the existing reference genome assembly, improving contiguity over fivefold. Our gap filling approach was highly accurate, validated by 10X Genomics long-distance linked-reads. In addition to closing a majority of gaps, we were able to assemble segments of telomeres and centromeres throughout the genome. This highlights the power of using long sequencing reads to assemble highly repetitive and difficult to assemble regions of genomes. This latest genome build has been released through a newly designed community genome browser that aims to consolidate the growing number of genomics datasets available for the threespine stickleback fish.

Download Full-text

System identification: Parameter and time-delay estimation for Wiener nonlinear systems with delayed input

Transactions of the Institute of Measurement and Control ◽

10.1177/0142331216674772 ◽

2016 ◽

Vol 40 (3) ◽

pp. 1035-1045 ◽

Cited By ~ 3

Author(s):

Asma Atitallah ◽

Saïda Bedoui ◽

Kamel Abderrahim

Keyword(s):

Time Delay ◽

System Identification ◽

Cost Function ◽

Delay System ◽

Time Delay Estimation ◽

Gradient Algorithm ◽

Delay Systems ◽

Optimization Approach ◽

Auxiliary Model ◽

The Cost

In this paper, a novel optimization approach to estimate the time delay and the parameters of Wiener time-delay systems is proposed. The proposed method consists first in defining a cost function and second in selecting an appropriate algorithm to solve it. However, any used cost function for the purpose of Wiener time-delay system identification presents several difficulties in terms of nonlinearity and inaccessible measurements. In fact, the hierarchical approach, the rounding property and the auxiliary model approach are suggested as solutions to overcome these difficulties. These solutions allow us to transform the cost function to be minimized into two simple cost functions that are minimized using the conjugate gradient algorithm with different choices of its main parameters. Simulation results are presented to illustrate the performance of the proposed approach.

Download Full-text

HEURISTIC OPTIMIZATION OF THE FOUNDATION OF A DYNAMICALLY STRESSED ROTATING MACHINE USING THE LATE ACCEPTANCE HILL CLIMBING (LAHC) ALGORITHM

DYNA INGENIERIA E INDUSTRIA ◽

10.6036/9762 ◽

2021 ◽

Vol 96 (5) ◽

pp. 498-504

Author(s):

JUAN LUIS TERRADEZ MARCO ◽

ANTONIO HOSPITALER PEREZ ◽

VICENTE ALBERO GABARDA

Keyword(s):

Cost Function ◽

Random Search ◽

Optimal Solution ◽

Hill Climbing ◽

Dynamic Loads ◽

Rotating Machine ◽

Permanent Working ◽

The Cost ◽

Late Acceptance Hill Climbing

This paper proposal the optimization of a foundation for rotative machine under dynamic loads in transient and permanent working mode. The foundation depends on fix parameters and 37 variables. Functional constraints are defined for the foundation. A cost function depending on the variables is defined to be minimized to find the optimal. From all the possible solutions, only are selected the ones that validate the constrains and minimize the cost function. The search of the optimal solution is made with an algorithm of random search by “neighbouring of one point” called Last Acceptance Hill Climbing(LAHC). It is an algorithm of the type called “Adaptative Memory Programming” (AMP) that accepts worse solutions to get the local minimum and learns of the results of the search. The algorithm only depends on the length of the comparison vector and the stop criteria. 350 experiences were made with different length of the comparison vector. It was analysed the quality of the optimal solutions got it with each length of the vector. Quality of the set of solutions was compared fitting them to a 3 parameters Weibull distribution.

Download Full-text

Illuminating the Black Box of Genome Sequence Assembly

The American Biology Teacher ◽

10.1525/abt.2013.75.8.9 ◽

2013 ◽

Vol 75 (8) ◽

pp. 572-577 ◽

Cited By ~ 2

Author(s):

D. Leland Taylor ◽

A. Malcolm Campbell ◽

Laurie J. Heyer

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

Black Box ◽

Assembly Process ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Genome Sequence Assembly ◽

A Genome ◽

The Cost ◽

Generation Sequencing

Next-generation sequencing technologies have greatly reduced the cost of sequencing genomes. With the current sequencing technology, a genome is broken into fragments and sequenced, producing millions of “reads.” A computer algorithm pieces these reads together in the genome assembly process. PHAST is a set of online modules (http://gcat.davidson.edu/phast) designed to teach advanced high school and college students the genome assembly process. PHAST allows users to assemble phage genomes in real time and includes tutorials detailing the complexities of genome assembly. With PHAST, students learn concepts behind genome assembly and understand how mathematics solves biological problems such as genome assembly.

Download Full-text