A chromosome-scale reference genome for Giardia intestinalis WB

AbstractGiardia intestinalis is a protist causing diarrhea in humans. The first G. intestinalis genome, from the WB isolate, was published more than ten years ago, and has been widely used as the reference genome for Giardia research. However, the genome is fragmented, thus hindering research at the chromosomal level. We re-sequenced the Giardia genome with Pacbio long-read sequencing technology and obtained a new reference genome, which was assembled into near-complete chromosomes with only four internal gaps at long repeats. This new genome is not only more complete but also better annotated at both structural and functional levels, providing more details about gene families, gene organizations and chromosomal structure. This near-complete reference genome will be a valuable resource for the Giardia community and protist research. It also showcases how a fragmented genome can be improved with long-read sequencing technology completed with optical maps.

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text

EquCab3, an Updated Reference Genome for the Domestic Horse

10.1101/306928 ◽

2018 ◽

Cited By ~ 9

Author(s):

Theodore S. Kalbfleisch ◽

Edward S. Rice ◽

Michael S. DePriest ◽

Brian P. Walenz ◽

Matthew S. Hestand ◽

...

Keyword(s):

Reference Genome ◽

Reference Sequence ◽

Large Animal ◽

Domestic Horse ◽

Sequencing Technology ◽

Proximity Ligation ◽

Genomics Research ◽

Long Read ◽

Solid Foundation ◽

Work Done

AbstractEquCab2, a high-quality reference genome for the domestic horse, was released in 2007. Since then, it has served as the foundation for nearly all genomic work done in equids. Recent advances in genomic sequencing technology and computational assembly methods have allowed scientists to improve reference assemblies of large animal and plant genomes in terms of contiguity and composition. In 2014, the equine genomics research community began a project to improve the reference sequence for the horse, building upon the solid foundation of EquCab2 and incorporating new short-read data, long-read data, and proximity ligation data. The result, EquCab3, is presented here. The count of non-N bases in the incorporated chromosomes is improved from 2.33Gb in EquCab2 to 2.41Gb from EquCab3. Contiguity has also been improved nearly 40-fold with a contig N50 of 4.5Mb and scaffold contiguity enhanced to where all but one of the 32 chromosomes is comprised of a single scaffold.

Download Full-text

QAlign: aligning nanopore reads accurately using current-level modeling

Bioinformatics ◽

10.1093/bioinformatics/btaa875 ◽

2020 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

Supplementary Information ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

Abstract Motivation Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner. Results We show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets. Availability and implementation https://github.com/joshidhaivat/QAlign.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

High quality, phased genomes of Phytophthora ramorum clonal lineages NA1 and EU1

10.1101/2021.06.23.449625 ◽

2021 ◽

Author(s):

Nicholas C Carleson ◽

Caroline M Press ◽

Niklaus J Grunwald

Keyword(s):

Reference Genome ◽

Phytophthora Ramorum ◽

Sudden Oak Death ◽

Valuable Resource ◽

High Quality ◽

Total Size ◽

Protein Coding ◽

Protein Coding Genes ◽

Live Oak ◽

Long Read

Phytophthora ramorum is the causal agent of sudden oak death in West Coast forests and currently two clonal lineages, NA1 and EU1, cause epidemics in Oregon forests. Here, we report on two high-quality genomes of individuals belonging to the NA1 and EU1 clonal lineages respectively, using PacBio long-read sequencing. The NA1 strain Pr102, originally isolated from coast live oak in California, is the current reference genome and was previously sequenced independently using either Sanger (P. ramorum v1) or PacBio (P. ramorum v2) technology. The EU1 strain PR-15-019 was obtained from tanoak in Oregon. These new genomes have a total size of 57.5 Mb, with a contig N50 length of ~3.5-3.6 Mb and encode ~15,300 predicted protein-coding genes. Genomes were assembled into 27 and 28 scaffolds with 95% BUSCO scores and are considerably improved relative to the current JGI reference genome with 2,575 or the PacBio genomes with 1,512 scaffolds. These high-quality genomes provide a valuable resource for studying the genetics, evolution, and adaptation of these two clonal lineages.

Download Full-text

SMRT sequencing yields the chromosome-scale reference genome of tea tree, Camellia sinensis var. sinensis

10.1101/2020.01.02.892430 ◽

2020 ◽

Cited By ~ 1

Author(s):

Qun-Jie Zhang ◽

Wei Li ◽

Kui Li ◽

Hong Nan ◽

Cong Shi ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

Repetitive Sequences ◽

Gene Families ◽

Chromosome Length ◽

Smrt Sequencing ◽

Protein Coding ◽

Tea Tree ◽

Long Read

AbstractTea is the oldest and most popular nonalcoholic beverage consumed in the world. It provides abundant secondary metabolites that account for its diverse flavors and health benefits. Here we present the first high-quality chromosome-length reference genome of C. sinensis var. sinensis using long read single-molecule real time (SMRT) sequencing and Hi-C technologies to anchor the ∼2.85-Gb genome assembly into 15 pseudo-chromosomes with a scaffold N50 length of ∼195.68 Mb. We annotated at least 2.17 Gb (∼74.13%) of repetitive sequences and high-confidence prediction of 40,812 protein-coding genes in the ∼2.92-Gb genome assembly. This accurately assembled genome allows us to comprehensively annotate functionally important gene families such as those involved in the biosynthesis of catechins, theanine and caffeine. The contiguous genome assembly provides the first view of the repetitive landscape allowing us to accurately characterize retrotransposon diversity. The large tea tree genome is dominated by a handful of Ty3-gypsy long terminal repeat (LTR) retrotransposon families that recently expanded to high copy numbers. We uncover the latest bursts of numerous non-autonomous LTR retrotransposons that may interfere with the propagation of autonomous retroelements. This reference genome sequence will largely facilitate the improvement of agronomically important traits relevant to the tea quality and production.

Download Full-text

Faculty Opinions recommendation of MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732346961.793543720 ◽

2018 ◽

Author(s):

Charles Baer

Keyword(s):

Caenorhabditis Elegans ◽

Reference Genome ◽

Long Read

Download Full-text

Haplotype-resolved genome of diploid ginger (Zingiber officinale) and its unique gingerol biosynthetic pathway

Horticulture Research ◽

10.1038/s41438-021-00627-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hong-Lei Li ◽

Lin Wu ◽

Zhaoming Dong ◽

Yusong Jiang ◽

Sanjie Jiang ◽

...

Keyword(s):

Biosynthetic Pathway ◽

Southwest China ◽

Reference Genome ◽

Zingiber Officinale ◽

Gene Families ◽

Chromosome Conformation ◽

Long Reads ◽

Transcription Factor Networks ◽

Species Specific ◽

Haplotype 1

AbstractGinger (Zingiber officinale), the type species of Zingiberaceae, is one of the most widespread medicinal plants and spices. Here, we report a high-quality, chromosome-scale reference genome of ginger ‘Zhugen’, a traditionally cultivated ginger in Southwest China used as a fresh vegetable, assembled from PacBio long reads, Illumina short reads, and high-throughput chromosome conformation capture (Hi-C) reads. The ginger genome was phased into two haplotypes, haplotype 1 (1.53 Gb with a contig N50 of 4.68 M) and haplotype 0 (1.51 Gb with a contig N50 of 5.28 M). Homologous ginger chromosomes maintained excellent gene pair collinearity. In 17,226 pairs of allelic genes, 11.9% exhibited differential expression between alleles. Based on the results of ginger genome sequencing, transcriptome analysis, and metabolomic analysis, we proposed a backbone biosynthetic pathway of gingerol analogs, which consists of 12 enzymatic gene families, PAL, C4H, 4CL, CST, C3’H, C3OMT, CCOMT, CSE, PKS, AOR, DHN, and DHT. These analyses also identified the likely transcription factor networks that regulate the synthesis of gingerol analogs. Overall, this study serves as an excellent resource for further research on ginger biology and breeding, lays a foundation for a better understanding of ginger evolution, and presents an intact biosynthetic pathway for species-specific gingerol biosynthesis.

Download Full-text

Amynthas corticis genome reveals molecular mechanisms behind global distribution

Communications Biology ◽

10.1038/s42003-021-01659-4 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Xing Wang ◽

Yi Zhang ◽

Yufeng Zhang ◽

Mingming Kang ◽

Yuanbo Li ◽

...

Keyword(s):

Genome Assembly ◽

Molecular Mechanisms ◽

Gene Families ◽

The Body ◽

Gene Family Evolution ◽

Complex Environments ◽

Protein Coding ◽

Itraq Analysis ◽

Rdna Sequencing ◽

Long Read

AbstractEarthworms (Annelida: Crassiclitellata) are widely distributed around the world due to their ancient origination as well as adaptation and invasion after introduction into new habitats over the past few centuries. Herein, we report a 1.2 Gb complete genome assembly of the earthworm Amynthas corticis based on a strategy combining third-generation long-read sequencing and Hi-C mapping. A total of 29,256 protein-coding genes are annotated in this genome. Analysis of resequencing data indicates that this earthworm is a triploid species. Furthermore, gene family evolution analysis shows that comprehensive expansion of gene families in the Amynthas corticis genome has produced more defensive functions compared with other species in Annelida. Quantitative proteomic iTRAQ analysis shows that expression of 147 proteins changed in the body of Amynthas corticis and 16 S rDNA sequencing shows that abundance of 28 microorganisms changed in the gut of Amynthas corticis when the earthworm was incubated with pathogenic Escherichia coli O157:H7. Our genome assembly provides abundant and valuable resources for the earthworm research community, serving as a first step toward uncovering the mysteries of this species, and may provide molecular level indicators of its powerful defensive functions, adaptation to complex environments and invasion ability.

Download Full-text

HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

BMC Bioinformatics ◽

10.1186/s12859-020-03939-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edwin A. Solares ◽

Yuan Tao ◽

Anthony D. Long ◽

Brandon S. Gaut

Keyword(s):

Cost Function ◽

Anopheles Funestus ◽

Hill Climbing ◽

Optimization Approach ◽

Sequencing Technology ◽

Genome Data ◽

A Genome ◽

Long Read ◽

Downstream Analysis ◽

The Cost

Abstract Background Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. Results Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). Conclusions HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.

Download Full-text

Long-read assembly and comparative evidence-based reanalysis of Cryptosporidium genome sequences reveals expanded transporter repertoire and duplication of entire chromosome ends including subtelomeric regions

Genome Research ◽

10.1101/gr.275325.121 ◽

2021 ◽

pp. gr.275325.121

Author(s):

Rodrigo P. Baptista ◽

Yiran Li ◽

Adam Sateriale ◽

Karen L. Brooks ◽

Alan Tracey ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Diarrheal Disease ◽

Gene Copy ◽

Future Research ◽

Single Nucleotide Variants ◽

Cryptosporidium Hominis ◽

Entire Chromosome ◽

Long Read ◽

Subtelomeric Regions

Cryptosporidiosis is a leading cause of waterborne diarrheal disease globally and an important contributor to mortality in infants and the immunosuppressed. Despite its importance, the Cryptosporidium community has only had access to a good, but incomplete, Cryptosporidium parvum IOWA reference genome sequence. Incomplete reference sequences hamper annotation, experimental design and interpretation. We have generated a new C. parvum IOWA genome assembly supported by PacBio and Oxford Nanopore long-read technologies and a new comparative and consistent genome annotation for three closely related species C. parvum, Cryptosporidium hominis and Cryptosporidium tyzzeri. We made 1,926 C. parvum annotation updates based on experimental evidence. They include new transporters, ncRNAs, introns and altered gene structures. The new assembly and annotation revealed a complete Dnmt2 methylase ortholog. Comparative annotation between C. parvum, C. hominis and C. tyzzeri revealed that most "missing" orthologs are found suggesting that the biological differences between the species must result from gene copy number variation, differences in gene regulation and single nucleotide variants (SNVs). Using the new assembly and annotation as reference, 190 genes are identified as evolving under positive selection, including many not detected previously. The new C. parvum IOWA reference genome assembly is larger, gap free and lacks ambiguous bases. This chromosomal assembly recovers all 16 chromosome ends, 13 of which are contiguously assembled. The three remaining chromosome ends are provisionally placed. These ends represent duplication of entire chromosome ends including subtelomeric regions revealing a new level of genome plasticity that will both inform and impact future research.

Download Full-text