An Algorithm to Build a Multi-genome Reference

1AbstractMotivationNew DNA sequencing technologies have enabled the rapid analysis of many thousands of genomes from a single species. At the same time, the conventional approach of mapping sequencing reads against a single reference genome sequence is no longer adequate. However, even where multiple high-quality reference genomes are available, the problem remains how one would integrate results from pairwise analyses.ResultTo overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.Availabilityhttps://github.com/LeilyR/[email protected]

Download Full-text

Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis

GigaScience ◽

10.1093/gigascience/giz114 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 7

Author(s):

Yongxin Li ◽

Yandong Ren ◽

Dongru Zhang ◽

Hui Jiang ◽

Zhongkai Wang ◽

...

Keyword(s):

Breeding Season ◽

Reference Genome ◽

Gene Families ◽

Sequencing Data ◽

High Quality ◽

Chromosome Conformation ◽

Functional Studies ◽

Sequencing Technologies ◽

A Genome ◽

Chromosome Level

Abstract Background The mustache toad, Vibrissaphora ailaonica, is endemic to China and belongs to the Megophryidae family. Like other mustache toad species, V. ailaonica males temporarily develop keratinized nuptial spines on their upper jaw during each breeding season, which fall off at the end of the breeding season. This feature is likely result of the reversal of sexual dimorphism in body size, with males being larger than females. A high-quality reference genome for the mustache toad would be invaluable to investigate the genetic mechanism underlying these repeatedly developing keratinized spines. Findings To construct the mustache toad genome, we generated 225 Gb of short reads and 277 Gb of long reads using Illumina and Pacific Biosciences (PacBio) sequencing technologies, respectively. Sequencing data were assembled into a 3.53-Gb genome assembly, with a contig N50 length of 821 kb. We also used high-throughput chromosome conformation capture (Hi-C) technology to identify contacts between contigs, then assembled contigs into scaffolds and assembled a genome with 13 chromosomes and a scaffold N50 length of 412.42 Mb. Based on the 26,227 protein-coding genes annotated in the genome, we analyzed phylogenetic relationships between the mustache toad and other chordate species. The mustache toad has a relatively higher evolutionary rate and separated from a common ancestor of the marine toad, bullfrog, and Tibetan frog 206.1 million years ago. Furthermore, we identified 201 expanded gene families in the mustache toad, which were mainly enriched in immune pathway, keratin filament, and metabolic processes. Conclusions Using Illumina, PacBio, and Hi-C technologies, we constructed the first high-quality chromosome-level mustache toad genome. This work not only offers a valuable reference genome for functional studies of mustache toad traits but also provides important chromosomal information for wider genome comparisons.

Download Full-text

Genome assembly of the JD17 soybean provides a new reference genome for Comparative genomics

10.1101/2021.11.23.469778 ◽

2021 ◽

Author(s):

Xinxin Yi ◽

Jing Liu ◽

Shengcai Chen ◽

Hao Wu ◽

Min Liu ◽

...

Keyword(s):

Nitrogen Fixation ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genomic Analysis ◽

Comparative Genomic ◽

High Quality ◽

Genome Wide ◽

A Genome ◽

Cultivated Soybean

Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromsome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with three published soybeans (WM82, ZH13 and W05) , which identified five large inversions and two large translocations specific to JD17, 20,984 - 46,912 PAVs spanning 13.1 - 46.9 Mb in size, and 5 - 53 large PAV clusters larger than 500kb. 1,695,741 - 3,664,629 SNPs and 446,689 - 800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation (SNF) genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.

Download Full-text

Kmer2SNP: reference-free SNP calling from raw reads based on matching

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

Evidence of multiple genome duplication events in Mytilus evolution

10.1101/2021.08.17.456601 ◽

2021 ◽

Author(s):

Ana Corrochano-Fraile ◽

Andrew Davie ◽

Stefano Carboni ◽

Michaël Bekaert

Keyword(s):

Whole Genome Duplication ◽

Evolutionary History ◽

Reference Genome ◽

Genome Duplication ◽

Correct Identification ◽

High Quality ◽

Sequencing Technologies ◽

Abiotic Stressors ◽

Duplication Events ◽

First Time

Molluscs remain one significantly under-represented taxa amongst available genomic resources, despite being the second-largest animal phylum and the recent advances in genomes sequencing technologies and genome assembly techniques. With the present work, we want to contribute to the growing efforts by filling this gap, presenting a new high-quality reference genome for Mytilus edulis and investigating the evolutionary history within the Mytilidae family, in relation to other species in the class Bivalvia. Here we present, for the first time, the discovery of multiple whole genome duplication events in the Mytilidae family and, more generally, in the class Bivalvia. In addition, the calculation of evolution rates for three species of the Mytilinae subfamily sheds new light onto the taxa evolution and highlights key orthologs of interest for the study of Mytilus species divergences. The reference genome presented here will enable the correct identification of molecular markers for evolutionary, population genetics, and conservation studies. Mytilidae have the capability to become a model shellfish for climate change adaptation using genome-enabled systems biology and multi-disciplinary studies of interactions between abiotic stressors, pathogen attacks, and aquaculture practises.

Download Full-text

Construction of a chromosome-scale long-read reference genome assembly for potato

GigaScience ◽

10.1093/gigascience/giaa100 ◽

2020 ◽

Vol 9 (9) ◽

Cited By ~ 3

Author(s):

Gina M Pham ◽

John P Hamilton ◽

Joshua C Wood ◽

Joseph T Burke ◽

Hainan Zhao ◽

...

Keyword(s):

Genome Sequence ◽

Reference Genome ◽

Agronomic Traits ◽

Solanum Tuberosum L ◽

Fold Increase ◽

High Quality ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies

Abstract Background Worldwide, the cultivated potato, Solanum tuberosum L., is the No. 1 vegetable crop and a critical food security crop. The genome sequence of DM1–3 516 R44, a doubled monoploid clone of S. tuberosum Group Phureja, was published in 2011 using a whole-genome shotgun sequencing approach with short-read sequence data. Current advanced sequencing technologies now permit generation of near-complete, high-quality chromosome-scale genome assemblies at minimal cost. Findings Here, we present an updated version of the DM1–3 516 R44 genome sequence (v6.1) using Oxford Nanopore Technologies long reads coupled with proximity-by-ligation scaffolding (Hi-C), yielding a chromosome-scale assembly. The new (v6.1) assembly represents 741.6 Mb of sequence (87.8%) of the estimated 844 Mb genome, of which 741.5 Mb is non-gapped with 731.2 Mb anchored to the 12 chromosomes. Use of Oxford Nanopore Technologies full-length complementary DNA sequencing enabled annotation of 32,917 high-confidence protein-coding genes encoding 44,851 gene models that had a significantly improved representation of conserved orthologs compared with the previous annotation. The new assembly has improved contiguity with a 595-fold increase in N50 contig size, 99% reduction in the number of contigs, a 44-fold increase in N50 scaffold size, and an LTR Assembly Index score of 13.56, placing it in the category of reference genome quality. The improved assembly also permitted annotation of the centromeres via alignment to sequencing reads derived from CENH3 nucleosomes. Conclusions Access to advanced sequencing technologies and improved software permitted generation of a high-quality, long-read, chromosome-scale assembly and improved annotation dataset for the reference genotype of potato that will facilitate research aimed at improving agronomic traits and understanding genome evolution.

Download Full-text

Towards complete and error-free genome assemblies of all vertebrate species

Nature ◽

10.1038/s41586-021-03451-0 ◽

2021 ◽

Vol 592 (7856) ◽

pp. 737-746 ◽

Cited By ~ 1

Author(s):

Arang Rhie ◽

Shane A. McCarthy ◽

Olivier Fedrigo ◽

Joana Damas ◽

Giulio Formenti ◽

...

Keyword(s):

Cost Effective ◽

Lessons Learned ◽

Vertebrate Species ◽

High Quality ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

Assembly Error ◽

Reference Genomes

AbstractHigh-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1–4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Download Full-text

A high-quality reference genome for a parasitic bivalve with doubly uniparental inheritance (Bivalvia: Unionida)

Genome Biology and Evolution ◽

10.1093/gbe/evab029 ◽

2021 ◽

Author(s):

Chase H Smith

Keyword(s):

Single Molecule ◽

Reference Genome ◽

Economic Value ◽

Freshwater Mussel ◽

Uniparental Inheritance ◽

Doubly Uniparental Inheritance ◽

High Quality ◽

Long Reads ◽

A Genome ◽

Genomic Studies

Abstract From a genomics perspective, bivalves (Mollusca: Bivalvia) have been poorly explored with the exception for those of high economic value. The bivalve order Unionida, or freshwater mussels, has been of interest in recent genomic studies due to their unique mitochondrial biology and peculiar life cycle. However, genomic studies have been hindered by the lack of a high-quality reference genome. Here, I present a genome assembly of Potamilus streckersoni using Pacific Bioscience single-molecule real-time long reads and 10X Genomics linked read sequencing. Further, I use RNA sequencing from multiple tissue types and life stages to annotate the reference genome. The final assembly was far superior to any previously published freshwater mussel genome and was represented by 2,368 scaffolds (2,472 contigs) and 1,776,755,624 bp, with a scaffold N50 of 2,051,244 bp. A high proportion of the assembly was comprised of repetitive elements (51.03%), aligning with genomic characteristics of other bivalves. The functional annotation returned 52,407 gene models (41,065 protein, 11,342 tRNAs), which was concordant with the estimated number of genes in other freshwater mussel species. This genetic resource, along with future studies developing high-quality genome assemblies and annotations, will be integral toward unraveling the genomic bases of ecologically and evolutionarily important traits in this hyper-diverse group.

Download Full-text

Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome

BMC Genomics ◽

10.1186/s12864-021-07493-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Richard J. Edwards ◽

Matt A. Field ◽

James M. Ferguson ◽

Olga Dudchenko ◽

Jens Keilwagen ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Genome Structure ◽

Canis Lupus Familiaris ◽

Structural Variations ◽

German Shepherd ◽

High Quality ◽

Entire Family ◽

The Impact ◽

Reference Genomes

Abstract Background Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness. Results Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection. Conclusions The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.

Download Full-text

Chromosomal-level assembly of Juglans sigillata genome using Nanopore, BioNano, and Hi-C analysis

GigaScience ◽

10.1093/gigascience/giaa006 ◽

2020 ◽

Vol 9 (2) ◽

Cited By ~ 3

Author(s):

De-Lu Ning ◽

Tao Wu ◽

Liang-Jun Xiao ◽

Ting Ma ◽

Wen-Liang Fang ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Juglans Regia ◽

Future Research ◽

High Quality ◽

Illumina Hiseq ◽

A Genome ◽

Contact Frequency ◽

Juglans Sigillata ◽

Chromosome Level

Abstract Background Juglans sigillata, or iron walnut, belonging to the order Juglandales, is an economically important tree species in Asia, especially in the Yunnan province of China. However, little research has been conducted on J. sigillata at the molecular level, which hinders understanding of its evolution, speciation, and synthesis of secondary metabolites, as well as its wide adaptability to its plateau environment. To address these issues, a high-quality reference genome of J. sigillata would be useful. Findings To construct a high-quality reference genome for J. sigillata, we first generated 38.0 Gb short reads and 66.31 Gb long reads using Illumina and Nanopore sequencing platforms, respectively. The sequencing data were assembled into a 536.50-Mb genome assembly with a contig N50 length of 4.31 Mb. Additionally, we applied BioNano technology to identify contacts among contigs, which were then used to assemble contigs into scaffolds, resulting in a genome assembly with scaffold N50 length of 16.43 Mb and contig N50 length of 4.34 Mb. To obtain a chromosome-level genome assembly, we constructed 1 Hi-C library and sequenced 79.97 Gb raw reads using the Illumina HiSeq platform. We anchored ∼93% of the scaffold sequences into 16 chromosomes and evaluated the quality of our assembly using the high contact frequency heat map. Repetitive elements account for 50.06% of the genome, and 30,387 protein-coding genes were predicted from the genome, of which 99.8% have been functionally annotated. The genome-wide phylogenetic tree indicated an estimated divergence time between J. sigillata and Juglans regia of 49 million years ago on the basis of single-copy orthologous genes. Conclusions We provide the first chromosome-level genome for J. sigillata. It will lay a valuable foundation for future research on the genetic improvement of J. sigillata.

Download Full-text

Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1611012113 ◽

2016 ◽

Vol 113 (35) ◽

pp. E5163-E5171 ◽

Cited By ~ 99

Author(s):

Jianwei Zhang ◽

Ling-Ling Chen ◽

Feng Xing ◽

David A. Kudrna ◽

Wen Yao ◽

...

Keyword(s):

Reference Genome ◽

Indica Rice ◽

Sequence Divergence ◽

Segmental Duplications ◽

Cultivated Rice ◽

High Quality ◽

Rice Varieties ◽

The Public ◽

Structural Differences ◽

Reference Genomes

Asian cultivated rice consists of two subspecies: Oryza sativa subsp. indica and O. sativa subsp. japonica. Despite the fact that indica rice accounts for over 70% of total rice production worldwide and is genetically much more diverse, a high-quality reference genome for indica rice has yet to be published. We conducted map-based sequencing of two indica rice lines, Zhenshan 97 (ZS97) and Minghui 63 (MH63), which represent the two major varietal groups of the indica subspecies and are the parents of an elite Chinese hybrid. The genome sequences were assembled into 237 (ZS97) and 181 (MH63) contigs, with an accuracy >99.99%, and covered 90.6% and 93.2% of their estimated genome sizes. Comparative analyses of these two indica genomes uncovered surprising structural differences, especially with respect to inversions, translocations, presence/absence variations, and segmental duplications. Approximately 42% of nontransposable element related genes were identical between the two genomes. Transcriptome analysis of three tissues showed that 1,059–2,217 more genes were expressed in the hybrid than in the parents and that the expressed genes in the hybrid were much more diverse due to their divergence between the parental genomes. The public availability of two high-quality reference genomes for the indica subspecies of rice will have large-ranging implications for plant biology and crop genetic improvement.

Download Full-text