Lost in Translation: The Pitfalls of Ensembl Gene Annotations Between Human Genome Assemblies and Their Impact on Diagnostics

Abstract Background:The GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.Methods:We performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed. ٌResults:We found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies.Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)Conclusion: We found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.

Download Full-text

Lost in translation: the pitfalls of Ensembl Gene annotations between human genome assemblies and their impact on diagnostics

10.1101/2020.11.12.380295 ◽

2020 ◽

Author(s):

Mohammed O.E Abdallah ◽

Mahmoud Koko ◽

Raj Ramesar

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Evolutionary Constraint ◽

Clinical Genetics ◽

Ensembl Gene ◽

Protein Coding ◽

Gene Annotations ◽

Human Genome Assembly ◽

Gene Models ◽

Genome Assemblies

AbstractBackgroundThe GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.MethodsWe performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed.ResultsWe found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies. Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)ConclusionWe found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.

Download Full-text

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

10.1101/306902 ◽

2018 ◽

Author(s):

Lauren Coombe ◽

Jessica Zhang ◽

Benjamin P Vandervalk ◽

Justin Chu ◽

Shaun D Jackman ◽

...

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Draft Genome ◽

Chromosome 1 ◽

Read Alignment ◽

Mapping Strategy ◽

Assembly Sequences ◽

Human Genome Assembly ◽

Run Time ◽

Genome Assemblies

AbstractBackgroundThe long-range sequencing information captured by linked reads, such as those available from 10x Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.ResultsHere, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50=4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly, which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders. Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n=13).ConclusionsARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, ARKS utilizes barcoding information from linked reads to estimate gap size. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale, genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.

Download Full-text

KmerKeys: a web resource for searching indexed genome assemblies and variants

10.1101/2021.05.17.444256 ◽

2021 ◽

Author(s):

Dmitri S Pavlichin ◽

HoJoon Lee ◽

Stephanie U Greer ◽

Susan M Grimes ◽

Tsachy Weissman ◽

...

Keyword(s):

Data Structure ◽

Human Genome ◽

Dna Sequences ◽

Genome Assembly ◽

Web Application ◽

Genomic Sequence ◽

Sequencing Analysis ◽

Web Resource ◽

Human Genome Assembly ◽

Genome Assemblies

K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. Despite these current applications, the wider bioinformatic use of k-mers in has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of these short sequences. The sheer amount of computation for effective use of k-mer information is enormous, particularly when involving multiple genome assemblies. To address these issues, we developed a new k-mer indexing data structure based on a hash table tuned for the lookup of k-mer keys. This web application, referred to as KmerKeys (https://kmerkeys.dgi-stanford.org/), provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact k-mer-based searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalog. This feature enables the incorporation of future genomic information into sequencing analysis.

Download Full-text

Reference Genome Assembly for Australian Ascochyta rabiei Isolate ArME14

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401265 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2131-2140

Author(s):

Ramisah Mohd Shah ◽

Angela H. Williams ◽

James K. Hane ◽

Julie A. Lawrence ◽

Lina M. Farfan-Caceres ◽

...

Keyword(s):

Secondary Metabolite ◽

Dna Sequences ◽

Genome Assembly ◽

Ascochyta Blight ◽

Ascochyta Rabiei ◽

Causal Organism ◽

Protein Coding ◽

Putative Protein ◽

Genomic Regions ◽

Genome Assemblies

Ascochyta rabiei is the causal organism of ascochyta blight of chickpea and is present in chickpea crops worldwide. Here we report the release of a high-quality PacBio genome assembly for the Australian A. rabiei isolate ArME14. We compare the ArME14 genome assembly with an Illumina assembly for Indian A. rabiei isolate, ArD2. The ArME14 assembly has gapless sequences for nine chromosomes with telomere sequences at both ends and 13 large contig sequences that extend to one telomere. The total length of the ArME14 assembly was 40,927,385 bp, which was 6.26 Mb longer than the ArD2 assembly. Division of the genome by OcculterCut into GC-balanced and AT-dominant segments reveals 21% of the genome contains gene-sparse, AT-rich isochores. Transposable elements and repetitive DNA sequences in the ArME14 assembly made up 15% of the genome. A total of 11,257 protein-coding genes were predicted compared with 10,596 for ArD2. Many of the predicted genes missing from the ArD2 assembly were in genomic regions adjacent to AT-rich sequence. We compared the complement of predicted transcription factors and secreted proteins for the two A. rabiei genome assemblies and found that the isolates contain almost the same set of proteins. The small number of differences could represent real differences in the gene complement between isolates or possibly result from the different sequencing methods used. Prediction pipelines were applied for carbohydrate-active enzymes, secondary metabolite clusters and putative protein effectors. We predict that ArME14 contains between 450 and 650 CAZymes, 39 putative protein effectors and 26 secondary metabolite clusters.

Download Full-text

Liftoff: accurate mapping of gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa1016 ◽

2020 ◽

Author(s):

Alaina Shumate ◽

Steven L Salzberg

Keyword(s):

Reference Genome ◽

Supplementary Information ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Gene Annotations ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly

Genomics Data ◽

10.1016/j.gdata.2015.05.035 ◽

2015 ◽

Vol 5 ◽

pp. 139-146 ◽

Cited By ~ 21

Author(s):

V.A. Shepelev ◽

L.I. Uralsky ◽

A.A. Alexandrov ◽

Y.B. Yurov ◽

E.I. Rogaev ◽

...

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Alpha Satellite ◽

Human Genome Assembly

Download Full-text

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Download Full-text

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

10.1101/715722 ◽

2019 ◽

Cited By ~ 21

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Ryan Lorig-Roach ◽

Marina Haukness ◽

Hugh E. Olsen ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Proximity Ligation ◽

Current State ◽

Human Genomes ◽

Sequencing Method ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Assembly Performance

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Download Full-text

Complete genomic and epigenetic maps of human centromeres

10.1101/2021.07.12.452052 ◽

2021 ◽

Author(s):

Nicolas Altemose ◽

Glennis Logsdon ◽

Andrey V Bzikadze ◽

Pragya Sidhwani ◽

Sasha A Langley ◽

...

Keyword(s):

Human Genome ◽

Evolutionary Dynamics ◽

Repetitive Sequences ◽

Extensive Study ◽

Sequence Evolution ◽

Centromeric Repeat ◽

Human Genome Assembly ◽

Single Base Resolution ◽

Genome Assemblies ◽

Complete Genomic

Existing human genome assemblies have almost entirely excluded highly repetitive sequences within and near centromeres, limiting our understanding of their sequence, evolution, and essential role in chromosome segregation. Here, we present an extensive study of newly assembled peri/centromeric sequences representing 6.2% (189.9 Mb) of the first complete, telomere-to-telomere human genome assembly (T2T-CHM13). We discovered novel patterns of peri/centromeric repeat organization, variation, and evolution at both large and small length scales. We also found that inner kinetochore proteins tend to overlap the most recently duplicated subregions within centromeres. Finally, we compared chromosome X centromeres across a diverse panel of individuals and uncovered structural, epigenetic, and sequence variation at single-base resolution across these regions. In total, this work provides an unprecedented atlas of human centromeres to guide future studies of their complex and critical functions as well as their unique evolutionary dynamics.

Download Full-text

SAGE2: parallel human genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btx648 ◽

2017 ◽

Vol 34 (4) ◽

pp. 678-680

Author(s):

Michael Molnar ◽

Ehsan Haghshenas ◽

Lucian Ilie

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Human Genome Assembly

Download Full-text