Using multiple reference genomes to identify and resolve annotation inconsistencies

AbstractBackgroundAdvances in sequencing technologies have led to the release of reference genomes and annotations for multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of synteny between each other, the annotated structure of gene models within these regions can differ. Of particular concern are split-gene misannotations, in which a single gene is incorrectly annotated as two distinct genes or two genes are incorrectly annotated as a single gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses.ResultsWe developed a high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model. We demonstrate the utility of our method using gene annotations of three reference genomes from maize (B73, PH207, and W22), a difficult system from an annotation perspective due to the size and complexity of the genome. On average, we find several hundred of these potential split-gene misannotations in each pairwise comparison, corresponding to 3-5% of gene models across annotations. To determine which state (i.e. one gene or multiple genes) is biologically supported, we utilize RNAseq data from 10 tissues throughout development along with a novel metric and simulation framework. The methods we have developed require minimal human interaction and can be applied to future assemblies to aid in annotation efforts.ConclusionsSplit-gene misannotations occur at appreciable frequency in maize annotations. We have developed a method to easily identify and correct these misannotations. Importantly, this method is generic in that it can utilize any type of short-read expression data. Failure to account for split-gene misannotations has serious consequences for biological inference, particularly for expression-based analyses.

Download Full-text

Towards complete and error-free genome assemblies of all vertebrate species

Nature ◽

10.1038/s41586-021-03451-0 ◽

2021 ◽

Vol 592 (7856) ◽

pp. 737-746 ◽

Cited By ~ 1

Author(s):

Arang Rhie ◽

Shane A. McCarthy ◽

Olivier Fedrigo ◽

Joana Damas ◽

Giulio Formenti ◽

...

Keyword(s):

Cost Effective ◽

Lessons Learned ◽

Vertebrate Species ◽

High Quality ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

Assembly Error ◽

Reference Genomes

AbstractHigh-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Download Full-text

Long-read based assembly and annotation of a Drosophila simulans genome

10.1101/425710 ◽

2018 ◽

Author(s):

Pierre Nouhaud

Keyword(s):

Drosophila Simulans ◽

Read Length ◽

Sequencing Technologies ◽

Final Assembly ◽

Rnaseq Data ◽

Long Read ◽

Gene Structures ◽

Full Length Transcript ◽

Genome Assemblies ◽

Nested Gene

AbstractLong-read sequencing technologies enable high-quality, contiguous genome assemblies. Here we used SMRT sequencing to assemble the genome of a Drosophila simulans strain originating from Madagascar, the ancestral range of the species. We generated 8 Gb of raw data (~50× coverage) with a mean read length of 6,410 bp, a NR50 of 9,125 bp and the longest subread at 49 kb. We benchmarked six different assemblers and merged the best two assemblies from Canu and Falcon. Our final assembly was 127.41 Mb with a N50 of 5.38 Mb and 305 contigs. We anchored more than 4 Mb of novel sequence to the major chromosome arms, and significantly improved the assembly of peri-centromeric and telomeric regions. Finally, we performed full-length transcript sequencing and used this data in conjunction with short-read RNAseq data to annotate 13,422 genes in the genome, improving the annotation in regions with complex, nested gene structures.

Download Full-text

Towards complete and error-free genome assemblies of all vertebrate species

10.1101/2020.05.22.110833 ◽

2020 ◽

Cited By ~ 16

Author(s):

Arang Rhie ◽

Shane A. McCarthy ◽

Olivier Fedrigo ◽

Joana Damas ◽

Giulio Formenti ◽

...

Keyword(s):

Cost Effective ◽

Lessons Learned ◽

Vertebrate Species ◽

High Quality ◽

New Era ◽

Sequencing Technologies ◽

Long Read ◽

Cartilaginous Fishes ◽

Genome Assemblies ◽

Reference Genomes

AbstractHigh-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.

Download Full-text

DNA Methylation in Solid Tumors: Functions and Methods of Detection

International Journal of Molecular Sciences ◽

10.3390/ijms22084247 ◽

2021 ◽

Vol 22 (8) ◽

pp. 4247

Author(s):

Andrea Martisova ◽

Jitka Holcakova ◽

Nasim Izadi ◽

Ravery Sebuyoya ◽

Roman Hrstka ◽

...

Keyword(s):

Dna Methylation ◽

Solid Tumors ◽

Epigenetic Modification ◽

Single Gene ◽

Restriction Enzymes ◽

Methylation Analysis ◽

Cpg Dinucleotides ◽

Sequencing Technologies ◽

Genome Level ◽

Dna Methylation Analysis

DNA methylation, i.e., addition of methyl group to 5′-carbon of cytosine residues in CpG dinucleotides, is an important epigenetic modification regulating gene expression, and thus implied in many cellular processes. Deregulation of DNA methylation is strongly associated with onset of various diseases, including cancer. Here, we review how DNA methylation affects carcinogenesis process and give examples of solid tumors where aberrant DNA methylation is often present. We explain principles of methods developed for DNA methylation analysis at both single gene and whole genome level, based on (i) sodium bisulfite conversion, (ii) methylation-sensitive restriction enzymes, and (iii) interactions of 5-methylcytosine (5mC) with methyl-binding proteins or antibodies against 5mC. In addition to standard methods, we describe recent advances in next generation sequencing technologies applied to DNA methylation analysis, as well as in development of biosensors that represent their cheaper and faster alternatives. Most importantly, we highlight not only advantages, but also disadvantages and challenges of each method.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

DNA Methylation and Intra-Clonal Heterogeneity: The Chronic Myeloid Leukemia Model

Cancers ◽

10.3390/cancers13143587 ◽

2021 ◽

Vol 13 (14) ◽

pp. 3587

Author(s):

Benjamin Lebecque ◽

Céline Bourgne ◽

Véronique Vidal ◽

Marc G. Berger

Keyword(s):

Dna Methylation ◽

Chronic Myeloid Leukemia ◽

Fusion Protein ◽

Myeloid Leukemia ◽

Kinase Inhibitors ◽

Single Gene ◽

Clonal Heterogeneity ◽

Sequencing Technologies ◽

Genome Wide ◽

The Impact

Chronic Myeloid Leukemia (CML) is a model to investigate the impact of tumor intra-clonal heterogeneity in personalized medicine. Indeed, tyrosine kinase inhibitors (TKIs) target the BCR-ABL fusion protein, which is considered the major CML driver. TKI use has highlighted the existence of intra-clonal heterogeneity, as indicated by the persistence of a minority subclone for several years despite the presence of the target fusion protein in all cells. Epigenetic modifications could partly explain this heterogeneity. This review summarizes the results of DNA methylation studies in CML. Next-generation sequencing technologies allowed for moving from single-gene to genome-wide analyses showing that methylation abnormalities are much more widespread in CML cells. These data showed that global hypomethylation is associated with hypermethylation of specific sites already at diagnosis in the early phase of CML. The BCR-ABL-independence of some methylation profile alterations and the recent demonstration of the initial intra-clonal DNA methylation heterogeneity suggests that some DNA methylation alterations may be biomarkers of TKI sensitivity/resistance and of disease progression risk. These results also open perspectives for understanding the epigenetic/genetic background of CML predisposition and for developing new therapeutic strategies.

Download Full-text

Bridging the Gap between Vertebrate Cytogenetics and Genomics with Single-Chromosome Sequencing (ChromSeq)

Genes ◽

10.3390/genes12010124 ◽

2021 ◽

Vol 12 (1) ◽

pp. 124

Author(s):

Alessio Iannucci ◽

Alexey I. Makunin ◽

Artem P. Lisachov ◽

Claudio Ciofi ◽

Roscoe Stanyon ◽

...

Keyword(s):

Genome Evolution ◽

Karyotype Evolution ◽

Genomic Data ◽

Anolis Carolinensis ◽

Vertebrate Genome ◽

Single Chromosome ◽

Sequencing Technologies ◽

Novel Approaches ◽

Genome Assemblies ◽

Generation Sequencing

The study of vertebrate genome evolution is currently facing a revolution, brought about by next generation sequencing technologies that allow researchers to produce nearly complete and error-free genome assemblies. Novel approaches however do not always provide a direct link with information on vertebrate genome evolution gained from cytogenetic approaches. It is useful to preserve and link cytogenetic data with novel genomic discoveries. Sequencing of DNA from single isolated chromosomes (ChromSeq) is an elegant approach to determine the chromosome content and assign genome assemblies to chromosomes, thus bridging the gap between cytogenetics and genomics. The aim of this paper is to describe how ChromSeq can support the study of vertebrate genome evolution and how it can help link cytogenetic and genomic data. We show key examples of ChromSeq application in the refinement of vertebrate genome assemblies and in the study of vertebrate chromosome and karyotype evolution. We also provide a general overview of the approach and a concrete example of genome refinement using this method in the species Anolis carolinensis.

Download Full-text

Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise

10.1101/2019.12.19.882399 ◽

2019 ◽

Cited By ~ 5

Author(s):

Valentina Peona ◽

Mozes P.K. Blom ◽

Luohao Xu ◽

Reto Burri ◽

Shawn Sullivan ◽

...

Keyword(s):

Dark Matter ◽

Genome Assembly ◽

Sex Chromosome ◽

De Novo ◽

Model Organism ◽

Technology Choice ◽

High Quality ◽

Sequencing Technologies ◽

Downstream Analysis ◽

Genome Assemblies

AbstractGenome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies have opened up a whole new world of genomic biodiversity. Although these technologies generate high-quality genome assemblies, there are still genomic regions difficult to assemble, like repetitive elements and GC-rich regions (genomic “dark matter”). In this study, we compare the efficiency of currently used sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter starting from the same sample. By adopting different de-novo assembly strategies, we were able to compare each individual draft assembly to a curated multiplatform one and identify the nature of the previously missing dark matter with a particular focus on transposable elements, multi-copy MHC genes, and GC-rich regions. Thanks to this multiplatform approach, we demonstrate the feasibility of producing a high-quality chromosome-level assembly for a non-model organism (paradise crow) for which only suboptimal samples are available. Our approach was able to reconstruct complex chromosomes like the repeat-rich W sex chromosome and several GC-rich microchromosomes. Telomere-to-telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects around the completeness of both the coding and non-coding parts of the genomes.

Download Full-text

A Deep Dive into Genome Assemblies of Non-vertebrate Animals

10.20944/preprints202111.0170.v1 ◽

2021 ◽

Author(s):

Nadège Guiglielmoni ◽

Ramón Rivera-Vicéns ◽

Romain Koszul ◽

Jean-François Flot

Keyword(s):

Genome Assembly ◽

Current Knowledge ◽

Genome Structure ◽

Deep Dive ◽

Sequencing Technologies ◽

Current State ◽

Animal Diversity ◽

And Function ◽

Genome Assemblies ◽

Genome Projects

Non-vertebrate species represent about ~95% of known metazoan (animal) diversity. They remain to this day relatively unexplored genetically, but understanding their genome structure and function is pivotal for expanding our current knowledge of evolution, ecology and biodiversity. Following the continuous improvements and decreasing costs of sequencing technologies, many genome assembly tools have been released, leading to a significant amount of genome projects being completed in recent years. In this review, we examine the current state of genome projects of non-vertebrate animal species. We present an overview of available sequencing technologies, assembly approaches, as well as pre and post-processing steps, genome assembly evaluation methods, and their application to non-vertebrate animal genomes.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

10.1101/2020.10.27.356907 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

AbstractMotivationWith the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes.ResultsWe introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual.Availability and ImplementationSVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/[email protected] informationSupplementary data are available online.

Download Full-text