Liftoff: accurate mapping of gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa1016 ◽

2020 ◽

Author(s):

Alaina Shumate ◽

Steven L Salzberg

Keyword(s):

Reference Genome ◽

Supplementary Information ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Gene Annotations ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Liftoff: an accurate gene annotation mapping tool

10.1101/2020.06.24.169680 ◽

2020 ◽

Cited By ~ 10

Author(s):

Alaina Shumate ◽

Steven L. Salzberg

Keyword(s):

Reference Genome ◽

Gene Annotation ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Mapping Tool ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

AbstractImprovements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.4% of human protein-coding genes to a chimpanzee genome assembly with 98.7% sequence identity.AvailabilityThe source code for Liftoff is available at https://github.com/agshumate/Liftoff

Download Full-text

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa253 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3885-3887 ◽

Cited By ~ 1

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L Warren

Keyword(s):

Reference Sequence ◽

Supplementary Information ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

Abstract Summary The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. Availability and implementation ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

10.1101/2020.01.13.905240 ◽

2020 ◽

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L. Warren

Keyword(s):

Reference Sequence ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome ◽

Reference Human Genome

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]

Download Full-text

Genome sequence of the model rice variety KitaakeX

10.1101/653089 ◽

2019 ◽

Author(s):

Rashmi Jain ◽

Jerry Jenkins ◽

Shengqiang Shu ◽

Mawsheng Chern ◽

Joel A. Martin ◽

...

Keyword(s):

Oryza Sativa ◽

Rice Plant ◽

De Novo ◽

Rice Variety ◽

High Quality ◽

Protein Coding ◽

Genomic Variations ◽

Protein Coding Genes ◽

Gene Annotations ◽

High Quality Genome

AbstractHere, we report the de novo genome sequencing and analysis of Oryza sativa ssp. japonica variety KitaakeX, a Kitaake plant carrying the rice XA21 immune receptor. Our KitaakeX sequence assembly contains 377.6 Mb, consisting of 33 scaffolds (476 contigs) with a contig N50 of 1.4 Mb. Complementing the assembly are detailed gene annotations of 35,594 protein coding genes. We identified 331,335 genomic variations between KitaakeX and Nipponbare (ssp. japonica), and 2,785,991 variations between KitaakeX and Zhenshan97 (ssp. indica). We also compared Kitaake resequencing reads to the KitaakeX assembly and identified 219 small variations. The high-quality genome of the model rice plant KitaakeX will accelerate rice functional genomics.

Download Full-text

Three chromosome-level duck genome assemblies provide insights into genomic variation during domestication

Nature Communications ◽

10.1038/s41467-021-26272-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Feng Zhu ◽

Zhong-Tao Yin ◽

Zheng Wang ◽

Jacqueline Smith ◽

Fan Zhang ◽

...

Keyword(s):

Mutant Cell ◽

Genomic Variation ◽

Pekin Duck ◽

Genomic Databases ◽

Protein Coding ◽

Avian Evolution ◽

Almost All ◽

Genome Assemblies ◽

High Quality Genome ◽

Chromosome Level

AbstractDomestic ducks are raised for meat, eggs and feather down, and almost all varieties are descended from the Mallard (Anas platyrhynchos). Here, we report chromosome-level high-quality genome assemblies for meat and laying duck breeds, and the Mallard. Our new genomic databases contain annotations for thousands of new protein-coding genes and recover a major percentage of the presumed “missing genes” in birds. We obtain the entire genomic sequences for the C-type lectin (CTL) family members that regulate eggshell biomineralization. Our population and comparative genomics analyses provide more than 36 million sequence variants between duck populations. Furthermore, a mutant cell line allows confirmation of the predicted anti-adipogenic function of NR2F2 in the duck, and uncovered mutations specific to Pekin duck that potentially affect adipose deposition. Our study provides insights into avian evolution and the genetics of oviparity, and will be a rich resource for the future genetic improvement of commercial traits in the duck.

Download Full-text

Near Chromosome-Level Genome Assembly and Annotation of Rhodotorula babjevae Strains Reveals High Intraspecific Divergence

10.20944/preprints202111.0517.v1 ◽

2021 ◽

Author(s):

Giselle C. Martin-Hernandez ◽

Bettina Müller ◽

Christian Brandt ◽

Martin Hölzer ◽

Adrian Viehweger ◽

...

Keyword(s):

Gc Content ◽

Pairwise Identity ◽

Protein Coding ◽

Intraspecific Divergence ◽

Protein Coding Genes ◽

Fungal Evolution ◽

Lignocellulose Hydrolysate ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

The genus Rhodotorula includes basidiomycetous oleaginous yeast species. R. babjevae can produce compounds of biotechnological interest such as lipids, carotenoids and biosurfactants from low value substrates such as lignocellulose hydrolysate. High-quality genome assemblies are needed to develop genetic tools and to understand fungal evolution and genetics. Here, we combined short- and long-read sequencing to resolve the genomes of two R. babjevae strains, CBS 7808 (type strain) and DBVPG 8058 at chromosomal level. Both genomes have a size of 21 Mbp and a GC content of 68.2%. Allele frequency analysis indicated tetraploidy in both strains. They harbor 21 putative chromosomes with sizes ranging from 0.4 to 2.4 Mb. In both assemblies, the mitochondrial genome was recovered in a single contig, which shared 97% pairwise identity. The pairwise identity between the majority of chromosomes ranges from 82% to 87%. We found indications for strain-specific extrachromosomal endogenous DNA. 7,591 protein-coding genes and 7,607 associated transcripts were annotated in CBS 7808 and 7,481 protein-coding genes and 7,516 associated transcripts in DBVPG 8058. CBS 7808 has accumulated a higher number of tandem duplications than DBVPG 8058. We identified large translocation events between putative chromosomes and a high genetic divergence between the two strains.

Download Full-text

Assembly and Annotation of an Ashkenazi Human Reference Genome

10.1101/2020.03.18.997395 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alaina Shumate ◽

Aleksey V. Zimin ◽

Rachel M. Sherman ◽

Daniela Puiu ◽

Justin M. Wagner ◽

...

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Gene Families ◽

Gene Content ◽

Specific Reference ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Reference Genomes ◽

Similar Gene

AbstractHere we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. 11 genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Download Full-text

Lost in Translation: The Pitfalls of Ensembl Gene Annotations Between Human Genome Assemblies and Their Impact on Diagnostics

10.21203/rs.3.rs-131927/v1 ◽

2020 ◽

Author(s):

Mohammed O.E Abdallah ◽

Mahmoud Koko ◽

Raj Ramesar

Keyword(s):

Human Genome ◽

Genome Assembly ◽

Evolutionary Constraint ◽

Clinical Genetics ◽

Ensembl Gene ◽

Protein Coding ◽

Gene Annotations ◽

Human Genome Assembly ◽

Gene Models ◽

Genome Assemblies

Abstract Background:The GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.Methods:We performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed. ٌResults:We found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies.Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)Conclusion: We found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

10.21203/rs.2.11742/v1 ◽

2019 ◽

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Our study enriched the spectrum of human genetic variations.

Download Full-text