High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

AbstractSatellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

Download Full-text

High satellite repeat turnover in great apes studied with short- and long-read technologies

10.1101/470054 ◽

2018 ◽

Cited By ~ 1

Author(s):

Monika Cechova ◽

Robert S. Harris ◽

Marta Tomaszkiewicz ◽

Barbara Arbeithuber ◽

Francesca Chiaromonte ◽

...

Keyword(s):

Great Apes ◽

Great Ape ◽

Satellite Repeat ◽

Satellite Sequences ◽

Satellite Repeats ◽

Repetitive Nature ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

AbstractSatellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Highly contiguous assemblies of 101 drosophilid genomes

10.1101/2020.12.14.422775 ◽

2020 ◽

Author(s):

Bernard Y Kim ◽

Jeremy Wang ◽

Danny E. Miller ◽

Olga Barmina ◽

Emily K. Delaney ◽

...

Keyword(s):

Community Resource ◽

High Quality ◽

Public Resource ◽

Oxford Nanopore ◽

Starting Point ◽

Long Read ◽

Wet Lab ◽

Species Groups ◽

Genome Assemblies ◽

High Quality Genome

Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long read sequencing allow high quality genome assemblies for tens or even hundreds of species to be generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of high-quality assemblies for 101 lines of 95 drosophilid species encompassing 14 species groups and 35 sub-groups with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. These assemblies, along with detailed wet lab protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution within this key group.

Download Full-text

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Download Full-text

A highly contiguous genome assembly of Brassica nigra (BB) and revised nomenclature for the pseudochromosomes

10.1101/2020.06.29.175869 ◽

2020 ◽

Author(s):

Kumar Paritosh ◽

Akshay Kumar Pradhan ◽

Deepak Pental

Keyword(s):

Genome Assembly ◽

Optical Mapping ◽

Indian Subcontinent ◽

Brassica Nigra ◽

Oilseed Crop ◽

B Genome ◽

Oxford Nanopore ◽

Long Read ◽

Black Mustard ◽

Genome Assemblies

AbstractBrassica nigra (BB), also called black mustard, is grown as a condiment crop in India. B. nigra represents the B genome of U’s triangle and is one of the progenitor species of B. juncea (AABB), an important oilseed crop of the Indian subcontinent. We report here a highly contiguous genome assembly of B. nigra variety Sangam. The genome assembly has been carried out using Oxford Nanopore long-read sequencing and optical mapping. The resulting chromosome-scale assembly is a significant improvement over the previous draft assemblies of B. nigra; five out of the eight pseudochromosomes were represented by one scaffold each. The assembled genome was annotated for the transposons, centromeric repeats, and genes. The B. nigra genome was compared with the recently available contiguous genome assemblies of B. rapa (AA), B. oleracea (CC), and B. juncea (AABB). Based on the maximum homology among the three diploid genomes of U’s triangle, we propose a new nomenclature for B. nigra pseudochromosomes, taking the B. rapa pseudochromosome nomenclature as the reference.

Download Full-text

Metagenomic data for Halichondria panicea from Illumina and Nanopore sequencing and preliminary genome assemblies for the sponge and two microbial symbionts.

10.1101/2021.10.18.464794 ◽

2021 ◽

Author(s):

Brian W Strehlow ◽

Astrid Schuster ◽

Warren R Francis ◽

Donald E Canfield

Keyword(s):

Additional Data ◽

Illumina Miseq ◽

Metagenomic Data ◽

Single Individual ◽

Halichondria Panicea ◽

Oxford Nanopore ◽

Long Read ◽

Microbial Symbionts ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Objectives: These data were collected to generate a novel reference metagenome for the sponge Halichondria panicea and its microbiome for subsequent differential expression analyses. Data description: These data include raw sequences from four separate sequencing runs of the metagenome of a single individual of H. panicea - one Illumina MiSeq (2x300 bp, paired-end) run and three Oxford Nanopore Technologies (ONT) long-read sequencing runs, generating 53.8 and 7.42 Gbp respectively. Comparing assemblies of Illumina, ONT and an Illumina-ONT hybrid revealed the hybrid to be the best assembly, comprising 163 Mbp in 63,555 scaffolds (N50: 3,084). This assembly, however, was still highly fragmented and only contained 52% of core metazoan genes (with 77.9% partial genes), so it was also not complete. However, this sponge is an emerging model species for field and laboratory work, and there is considerable interest in genomic sequencing of this species. Although the resultant assemblies from the data presented here are suboptimal, this data note can inform future studies by providing an estimated genome size and coverage requirements for future sequencing, sharing additional data to potentially improve other suboptimal assemblies of this species, and outlining potential limitations and pitfalls of the combined Illumina and ONT approach to novel genome sequencing.

Download Full-text

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Communications Biology ◽

10.1038/s42003-021-02559-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Caroline Belser ◽

Franc-Christophe Baurens ◽

Benjamin Noel ◽

Guillaume Martin ◽

Corinne Cruaud ◽

...

Keyword(s):

Musa Acuminata ◽

Genetic Maps ◽

Nanopore Sequencing ◽

Genome Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

First Time

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75 kbp. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.

Download Full-text

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

10.1101/016519 ◽

2015 ◽

Cited By ~ 4

Author(s):

Rene L Warren ◽

Benjamin P Vandervalk ◽

Steven JM Jones ◽

Inanc Birol

Keyword(s):

Genome Assembly ◽

Error Rates ◽

Great Promise ◽

Genomic Information ◽

E Coli ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

K 12 ◽

Genome Assemblies

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Established and emerging long read technologies show great promise in this regard, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they could be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a solution that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We show how the contiguity of an ABySS E. coli K-12 genome assembly could be increased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads and how LINKS leverages long-range information in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. Re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) and how LINKS scales to larger genomes is also presented. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. Availability: http://www.bcgsc.ca/bioinfo/software/links

Download Full-text

Draft genome assemblies using sequencing reads from Oxford Nanopore Technology and Illumina platforms for four species of North American Fundulus killifish

GigaScience ◽

10.1093/gigascience/giaa067 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 3

Author(s):

Lisa K Johnson ◽

Ruta Sahasrabudhe ◽

James Anthony Gill ◽

Jennifer L Roach ◽

Lutz Froenicke ◽

...

Keyword(s):

North American ◽

De Novo ◽

Draft Genome ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Sequence Coverage ◽

Short Read ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Background Whole-genome sequencing data from wild-caught individuals of closely related North American killifish species (Fundulus xenicus, Fundulus catenatus, Fundulus nottii, and Fundulus olivaceus) were obtained using long-read Oxford Nanopore Technology (ONT) PromethION and short-read Illumina platforms. Findings Draft de novo reference genome assemblies were generated using a combination of long and short sequencing reads. For each species, the PromethION platform was used to generate 30–45× sequence coverage, and the Illumina platform was used to generate 50–160× sequence coverage. Illumina-only assemblies were fragmented with high numbers of contigs, while ONT-only assemblies were error prone with low BUSCO scores. The highest N50 values, ranging from 0.4 to 2.7 Mb, were from assemblies generated using a combination of short- and long-read data. BUSCO scores were consistently >90% complete using the Eukaryota database. Conclusions High-quality genomes can be obtained from a combination of using short-read Illumina data to polish assemblies generated with long-read ONT data. Draft assemblies and raw sequencing data are available for public use. We encourage use and reuse of these data for assembly benchmarking and other analyses.

Download Full-text

Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacbio Sequel II system and ultralong reads of Oxford Nanopore

10.1101/2020.02.13.948489 ◽

2020 ◽

Author(s):

Dandan Lang ◽

Shilai Zhang ◽

Pingping Ren ◽

Fan Liang ◽

Zongyi Sun ◽

...

Keyword(s):

Gene Families ◽

Single Chromosome ◽

Small Indels ◽

Base Level ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Single Rice ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

AbstractThe availability of reference genomes has revolutionized the study of biology. Multiple competing technologies have been developed to improve the quality and robustness of genome assemblies during the last decade. The two widely-used long read sequencing providers – Pacbio (PB) and Oxford Nanopore Technologies (ONT) – have recently updated their platforms: PB enable high throughput HiFi reads with base-level resolution with >99% and ONT generated reads as long as 2 Mb. We applied the two up-to-date platforms to one single rice individual, and then compared the two assemblies to investigate the advantages and limitations of each. The results showed that ONT ultralong reads delivered higher contiguity producing a total of 18 contigs of which 10 were assembled into a single chromosome compared to that of 394 contigs and three chromosome-level contigs for the PB assembly. The ONT ultralong reads also prevented assembly errors caused by long repetitive regions for which we observed a total 44 genes of false redundancies and 10 genes of false losses in the PB assembly leading to over/under-estimations of the gene families in those long repetitive regions. We also noted that the PB HiFi reads generated assemblies with considerably less errors at the level of single nucleotide and small InDels than that of the ONT assembly which generated an average 1.06 errors per Kb assembly and finally engendered 1,475 incorrect gene annotations via altered or truncated protein predictions.

Download Full-text