The hidden structural variability in avian genomes

Structural variants (SVs) are DNA mutations that can have relevant effects at micro- and macro-evolutionary scales. The detection of SVs is largely limited by the type and quality of sequencing technologies adopted, therefore genetic variability linked to SVs may remain undiscovered, especially in complex repetitive genomic regions. In this study, we used a combination of long-read and linked-read genome assemblies to investigate the occurrence of insertions and dele-tions across the chromosomes of 14 species of birds-of-paradise and two species of estrildid finches including highly repetitive W chro-mosomes. The species sampling encompasses most genera and representatives from all major clades of birds-of-paradise, allowing comparisons between individuals of the same species, genus, and family. We found the highest densities of SVs to be located on the microchromosomes and on the female-specific W chromosome. Genome assemblies of multiple individuals from the same species allowed us to compare the levels of genetic variability linked to SVs and single nucleotide polymorphisms (SNPs) on the W and other chromosomes. Our results demonstrate that the avian W chromosome harbours more genetic variability than previously thought and that its structure is shaped by the continuous accumulation and turn-over of transposable element insertions, especially endogenous retroviruses.

Download Full-text

Read coverage as an indicator of misassembly in a short-read based genome assembly

10.1101/790337 ◽

2019 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Machine Learning ◽

Great Majority ◽

Significant Advance ◽

High Coverage ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies ◽

Simple Sequence

ABSTRACTAvailability of genome sequences has led to significant advance in biology. With few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues. In tomato, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. We established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have lower simple sequence repeat but higher tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially mis-assembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a machine learning model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to misassembly when using short reads.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

10.1101/2020.01.13.905240 ◽

2020 ◽

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L. Warren

Keyword(s):

Reference Sequence ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome ◽

Reference Human Genome

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]

Download Full-text

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v2 ◽

2021 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v1 ◽

2020 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

Sequence Transpositions Restore Genes on the Highly Degenerated W Chromosomes of Songbirds

Genes ◽

10.3390/genes11111267 ◽

2020 ◽

Vol 11 (11) ◽

pp. 1267

Author(s):

Luohao Xu ◽

Martin Irestedt ◽

Qi Zhou

Keyword(s):

Bird Species ◽

Purifying Selection ◽

Great Tit ◽

W Chromosome ◽

Evolutionary Forces ◽

Darwin’S Finches ◽

Birds Of Paradise ◽

Female Specific ◽

Dosage Imbalance ◽

Dosage Sensitivity

The female-specific W chromosomes of most Neognathae birds are highly degenerated and gene-poor. Previous studies have demonstrated that the gene repertoires of the Neognathae bird W chromosomes, despite being in small numbers, are conserved across bird species, likely due to purifying selection maintaining the regulatory and dosage-sensitive genes. Here we report the discovery of DNA-based sequence duplications from the Z to the W chromosome in birds-of-paradise (Paradisaeidae, Passeriformes), through sequence transposition. The original transposition involved nine genes, but only two of them (ANXA1 and ALDH1A1) survived on the W chromosomes. Both ANXA1 and ALDH1A1 are predicted to be dosage-sensitive, and the expression of ANXA1 is restricted to ovaries in all the investigated birds. These analyses suggest the newly transposed gene onto the W chromosomes can be favored for their role in restoring dosage imbalance or through female-specific selection. After examining seven additional songbird genomes, we further identified five other transposed genes on the W chromosomes of Darwin’s finches and one in the great tit, expanding the observation of the Z-to-W transpositions to a larger range of bird species, but not all transposed genes exhibit dosage-sensitivity or ovary-biased expression We demonstrate a new mechanism by which the highly degenerated W chromosomes of songbirds can acquire genes from the homologous Z chromosomes, but further functional investigations are needed to validate the evolutionary forces underlying the transpositions.

Download Full-text