scholarly journals A de novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

2021 ◽  
Vol 12 ◽  
Author(s):  
Sigmund Ramberg ◽  
Bjørn Høyheim ◽  
Tone-Kari Knutsdatter Østbye ◽  
Rune Andreassen

Atlantic salmon (Salmo salar) is a major species produced in world aquaculture and an important vertebrate model organism for studying the process of rediploidization following whole genome duplication events (Ss4R, 80 mya). The current Salmo salar transcriptome is largely generated from genome sequence based in silico predictions supported by ESTs and short-read sequencing data. However, recent progress in long-read sequencing technologies now allows for full-length transcript sequencing from single RNA-molecules. This study provides a de novo full-length mRNA transcriptome from liver, head-kidney and gill materials. A pipeline was developed based on Iso-seq sequencing of long-reads on the PacBio platform (HQ reads) followed by error-correction of the HQ reads by short-reads from the Illumina platform. The pipeline successfully processed more than 1.5 million long-reads and more than 900 million short-reads into error-corrected HQ reads. A surprisingly high percentage (32%) represented expressed interspersed repeats, while the remaining were processed into 71 461 full-length mRNAs from 23 071 loci. Each transcript was supported by several single-molecule long-read sequences and at least three short-reads, assuring a high sequence accuracy. On average, each gene was represented by three isoforms. Comparisons to the current Atlantic salmon transcripts in the RefSeq database showed that the long-read transcriptome validated 25% of all known transcripts, while the remaining full-length transcripts were novel isoforms, but few were transcripts from novel genes. A comparison to the current genome assembly indicates that the long-read transcriptome may aid in improving transcript annotation as well as provide long-read linkage information useful for improving the genome assembly. More than 80% of transcripts were assigned GO terms and thousands of transcripts were from genes or splice-variants expressed in an organ-specific manner demonstrating that hybrid error-corrected long-read transcriptomes may be applied to study genes and splice-variants expressed in certain organs or conditions (e.g., challenge materials). In conclusion, this is the single largest contribution of full-length mRNAs in Atlantic salmon. The results will be of great value to salmon genomics research, and the pipeline outlined may be applied to generate additional de novo transcriptomes in Atlantic Salmon or applied for similar projects in other species.

2021 ◽  
Author(s):  
Lauren Coombe ◽  
Janet X Li ◽  
Theodora Lo ◽  
Johnathan Wong ◽  
Vladimir Nikolic ◽  
...  

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.


2019 ◽  
Author(s):  
Aaron M. Wenger ◽  
Paul Peluso ◽  
William J. Rowell ◽  
Pi-Chuan Chang ◽  
Richard J. Hall ◽  
...  

AbstractThe major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.


Author(s):  
Robert Vaser ◽  
Mile Šikić

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.


Author(s):  
Shien Huang ◽  
Xinyu He ◽  
Guohua Wang ◽  
Ergude Bao

Abstract Contigs assembled from the third-generation sequencing long reads are usually more complete than the second-generation short reads. However, the current algorithms still have difficulty in assembling the long reads into the ideal complete and accurate genome, or the theoretical best result [1]. To improve the long read contigs and with more and more fully sequenced genomes available, it could still be possible to use the similar genome-assisted reassembly method [2], which was initially proposed for the short reads making use of a closely related genome (similar genome) to the sequencing genome (target genome). The method aligns the contigs and reads to the similar genome, and then extends and refines the aligned contigs with the aligned reads. Here, we introduce AlignGraph2, a similar genome-assisted reassembly pipeline for the PacBio long reads. The AlignGraph2 pipeline is the second version of AlignGraph algorithm proposed by us but completely redesigned, can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and preassembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7–27.2% more long reads and 7.3–56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others. For contigs assembled with various de novo algorithms and aligned to similar genomes (aligned contigs), AlignGraph2 can extend 8.7–94.7% of them (extendable contigs), and obtain contigs of 7.0–249.6% larger N50 value and 5.2–87.7% smaller number of indels per 100 kbp (extended contigs). With genomes of decreased similarities, AlignGraph2 also has relatively stable performance. The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.


2020 ◽  
Author(s):  
Mohamed Awad ◽  
Xiangchao Gan

AbstractHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we propose GALA (Gap-free long-read assembler), a chromosome-by-chromosome assembly method implemented through a multi-layer computer graph that identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates a gap-free assembly free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, a reference genome and even motif analyses, to generate gap-free chromosome-scale assemblies. We de novo assembled the C. elegans and A. thaliana genomes using combined Pacbio and Nanopore sequencing data from publicly available datasets. We also demonstrated the new method’s applicability with a gap-free assembly of a human genome with the help a reference genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.


2019 ◽  
Author(s):  
Han Ming Gan ◽  
Christopher M. Austin

AbstractBackgroundVibrio parahaemolyticus MVP1 was isolated from a Malaysian aquaculture farm affected with shrimp acute hepatopancreatic necrosis disease (AHPND). Its genome was previously sequenced on the Illumina MiSeq platform and assembled de novo producing a relatively fragmented assembly. Despite identifying the binary toxin genes in the MVP1 draft genome that were linked to AHPND, the toxin genes were localized on a very small contig precluding proper analysis of gene neighbourhood.MethodsThe genome of Vibrio parahaemolyticus MVP1 was sequenced on the Nanopore MinION device to obtain long reads that can span longer repeats and improve genome contiguity. De novo genome assembly was subsequently performed using long-read only assembler (Flye) followed by genome polishing as well as hybrid assembler (Unicycler).ResultsLong-read only assembly produced three complete circular MVP1 contigs consisting of chromosome 1, chromosome 2 and the pVa plasmid that pirABvp binary toxin genes. Polishing of the long read assembly with Illumina short reads was necessary to remove indel errors. The complete assembly of the pVa plasmid could not be achieved using Illumina reads due to the presence of identical repetitive elements flanking the binary toxin genes leading to multiple contigs. Whereas these regions were fully spanned by the Nanopore long reads resulting in a single contig. In addition, alignment of Illumina reads to the complete genome assembly indicated there is sequencing bias as read depth was lowest in low-GC genomic regions. Comparative genomic analysis revealed the presence of a gene cluster coding for additional insecticidal toxins in chromosome 2 of MVP1 that may further contribute to host pathogenesis pending functional validation. Scanning of all publicly available V. parahaemolyticus genomes revealed the presence of a single AinS-family quorum-sensing system in this species that can be targeted for future microbial management.ConclusionsWe generated the first chromosome-scale genome assembly of a Malaysian pirABVp-bearing V. parahaemolyticus isolate. Structural variations identified from comparative genomic analysis provide new insights into the genomic features of V. parahaemolyticus MVP1 that may be associated with host colonization and pathogenicity.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kelsi A. Lindblad ◽  
Jananan S. Pathmanathan ◽  
Sandrine Moreira ◽  
John R. Bracht ◽  
Robert P. Sebra ◽  
...  

Abstract Background Whole-genome shotgun sequencing, which stitches together millions of short sequencing reads into a single genome, ushered in the era of modern genomics and led to a rapid expansion of the number of genome sequences available. Nevertheless, assembly of short reads remains difficult, resulting in fragmented genome sequences. Ultimately, only a sequencing technology capable of capturing complete chromosomes in a single run could resolve all ambiguities. Even “third generation” sequencing technologies produce reads far shorter than most eukaryotic chromosomes. However, the ciliate Oxytricha trifallax has a somatic genome with thousands of chromosomes averaging only 3.2 kbp, making it an ideal candidate for exploring the benefits of sequencing whole chromosomes without assembly. Results We used single-molecule real-time sequencing to capture thousands of complete chromosomes in single reads and to update the published Oxytricha trifallax JRB310 genome assembly. In this version, over 50% of the completed chromosomes with two telomeres derive from single reads. The improved assembly includes over 12,000 new chromosome isoforms, and demonstrates that somatic chromosomes derive from variable rearrangements between somatic segments encoded up to 191,000 base pairs away. However, while long reads reduce the need for assembly, a hybrid approach that supplements long-read sequencing with short reads for error correction produced the most complete and accurate assembly, overall. Conclusions This assembly provides the first example of complete eukaryotic chromosomes captured by single sequencing reads and demonstrates that traditional approaches to genome assembly can mask considerable structural variation.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lauren Coombe ◽  
Janet X. Li ◽  
Theodora Lo ◽  
Johnathan Wong ◽  
Vladimir Nikolic ◽  
...  

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.


2018 ◽  
Author(s):  
Florencia Diaz-Viraque ◽  
Sebastian Pita ◽  
Gonzalo Greif ◽  
Rita de Cassia Moreira de Souza ◽  
Gregorio Iraola ◽  
...  

Chagas disease was described by Carlos Chagas, who first identified the parasite Trypanosoma cruzi from a two-year-old girl called Berenice. Many T. cruzi sequencing projects based on short reads have demonstrated that genome assembly and downstream comparative analyses are extremely challenging in this species, given that half of its genome is composed of repetitive sequences. Here, we report de novo assemblies, annotation and comparative analyses of the Berenice strain using a combination of Illumina short reads and MinION long reads. Our work demonstrates that Nanopore sequencing improves T. cruzi assembly contiguity and increases the assembly size in ~16 Mb. Specifically, we found that assembly improvement also refines the completeness of coding regions for both single copy genes and repetitive transposable elements. Beyond its historical and epidemiological importance, Berenice constitutes a fundamental resource since it now represents the best-quality assembly available for TcII, a highly prevalent lineage causing human infections in South America. The availability of Berenice genome expands the known genetic diversity of T. cruzi and facilitates more comprehensive evolutionary inferences. Our work represents the first report of Nanopore technology used to resolve complex protozoan genomes, supporting its subsequent application for improving trypanosomatid and other highly repetitive genomes.


2021 ◽  
Author(s):  
Xiao Luo ◽  
Xiongbin Kang ◽  
Alexander Schoenhuth

Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly thanks to advantages of read length. However, current long-read assemblers usually introduce disturbing biases or fail to capture the haplotype diversity of the diploid genome. Here, we present phasebook, a novel approach for reconstructing the haplotypes of diploid genomes from long reads de novo. Benchmarking experiments demonstrate that our method outperforms other approaches in terms of haplotype coverage by large margins, while preserving competitive performance or even achieving advantages in terms of all other aspects relevant for genome assembly.


Sign in / Sign up

Export Citation Format

Share Document