scholarly journals iLoci: Robust evaluation of genome content and organization for provisional and mature genome assemblies

2021 ◽  
Author(s):  
Daniel S Standage ◽  
Tim Lai ◽  
Volker P Brendel

Background: The rate at which new draft genome assemblies and corresponding annotation versions are being produced has long outpaced the scientific community's capacity to refine these drafts into finished, reference-quality data resources to a standard typically expected from dedicated efforts of model organism research communities. Nonetheless, scientists must be able to evaluate newly sequenced genomes in the context of previously published data, requiring summaries of genome content and organization that can be quickly computed, updated, and meaningfully compared. As annotation quality will necessarily vary within and across data sets, the ability to select subsets of only those data that are well supported is critical for distinguishing technical artifacts from biological effects in genome-wide analyses. Results: We introduce a new framework for genome analyses based on parsing an annotated genome assembly into distinct interval loci (iLoci), available as open source software as part of the AEGeAn Toolkit (https://github.com/BrendelGroup/AEGeAn). We demonstrate that iLoci provide an alternative coordinate system that is robust to changes in assembly and annotation versions and facilitates granular quality control of genome data. We discuss how statistics computed on iLoci reflect various characteristics of genome content and organization and illustrate how these statistics can be used to establish a baseline for assessment of the completeness and accuracy of the data. We also introduce a well-defined measure of relative genome compactness and compute other iLocus statistics that reveal genome-wide characteristics of gene arrangements in the whole genome context. Conclusions: We present a coherent computational framework that calculates informative statistics from genome assembly/annotation data input. Given the fast pace of assembly/annotation updates, our AEGeAn Toolkit fills a niche in computational genomics based on deriving persistent and species-specific genome statistics. Gene structure model centric iLoci provide a precisely defined coordinate system that can be used to store assembly/annotation updates that reflect either stable or changed assessments. Large-scale application of the approach revealed species and clade specific genome organization in precisely defined computational terms, promising intriguing forays into the forces of shaping genome structure as more and more genome assemblies are being deposited.

Author(s):  
Nadège Guiglielmoni ◽  
Ramón Rivera-Vicéns ◽  
Romain Koszul ◽  
Jean-François Flot

Non-vertebrate species represent about ~95% of known metazoan (animal) diversity. They remain to this day relatively unexplored genetically, but understanding their genome structure and function is pivotal for expanding our current knowledge of evolution, ecology and biodiversity. Following the continuous improvements and decreasing costs of sequencing technologies, many genome assembly tools have been released, leading to a significant amount of genome projects being completed in recent years. In this review, we examine the current state of genome projects of non-vertebrate animal species. We present an overview of available sequencing technologies, assembly approaches, as well as pre and post-processing steps, genome assembly evaluation methods, and their application to non-vertebrate animal genomes.


Genes ◽  
2021 ◽  
Vol 12 (9) ◽  
pp. 1336
Author(s):  
Azamat Totikov ◽  
Andrey Tomarovsky ◽  
Dmitry Prokopov ◽  
Aliya Yakupova ◽  
Tatiana Bulyonkova ◽  
...  

Genome assemblies are in the process of becoming an increasingly important tool for understanding genetic diversity in threatened species. Unfortunately, due to limited budgets typical for the area of conservation biology, genome assemblies of threatened species, when available, tend to be highly fragmented, represented by tens of thousands of scaffolds not assigned to chromosomal locations. The recent advent of high-throughput chromosome conformation capture (Hi-C) enables more contiguous assemblies containing scaffolds spanning the length of entire chromosomes for little additional cost. These inexpensive contiguous assemblies can be generated using Hi-C scaffolding of existing short-read draft assemblies, where N50 of the draft contigs is larger than 0.1% of the estimated genome size and can greatly improve analyses and facilitate visualization of genome-wide features including distribution of genetic diversity in markers along chromosomes or chromosome-length scaffolds. We compared distribution of genetic diversity along chromosomes of eight mammalian species, including six listed as threatened by IUCN, where both draft genome assemblies and newer chromosome-level assemblies were available. The chromosome-level assemblies showed marked improvement in localization and visualization of genetic diversity, especially where the distribution of low heterozygosity across the genomes of threatened species was not uniform.


Gigabyte ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Julia Voelker ◽  
Mervyn Shepherd ◽  
Ramil Mauleon

The economically important Melaleuca alternifolia (tea tree) is the source of a terpene-rich essential oil with therapeutic and cosmetic uses around the world. Tea tree has been cultivated and bred in Australia since the 1990s. It has been extensively studied for the genetics and biochemistry of terpene biosynthesis. Here, we report a high quality de novo genome assembly using Pacific Biosciences and Illumina sequencing. The genome was assembled into 3128 scaffolds with a total length of 362 Mb (N50  = 1.9 Mb), with significantly higher contiguity than a previous assembly (N50  = 8.7 Kb). Using a homology-based, RNA-seq evidence-based and ab initio prediction approach, 37,226 protein-coding genes were predicted. Genome assembly and annotation exhibited high completeness scores of 98.1% and 89.4%, respectively. Sequence contiguity was sufficient to reveal extensive gene order conservation and chromosomal rearrangements in alignments with Eucalyptus grandis and Corymbia citriodora genomes. This new genome advances currently available resources to investigate the genome structure and gene family evolution of M. alternifolia. It will enable further comparative genomic studies in Myrtaceae to elucidate the genetic foundations of economically valuable traits in this crop.


Marine Drugs ◽  
2019 ◽  
Vol 17 (7) ◽  
pp. 386 ◽  
Author(s):  
Chao Bian ◽  
Jia Li ◽  
Xueqiang Lin ◽  
Xiyang Chen ◽  
Yunhai Yi ◽  
...  

Blue tilapia (Oreochromis aureus) has been an economically important fish in Asian countries. It can grow and reproduce in both freshwater and brackish water conditions, whereas it is also considered as a significant invasive species around the world. This species has been widely used as the hybridization parent(s) for tilapia breeding with a major aim to produce novel strains. However, available genomic resources are still limited for this important tilapia species. Here, we for the first time sequenced and assembled a draft genome for a seawater cultured blue tilapia (0.92 Gb), with 97.8% completeness and a scaffold N50 of 1.1 Mb, which suggests a relatively high quality of this genome assembly. We also predicted 23,117 protein-coding genes in the blue tilapia genome. Comparisons of predicted antimicrobial peptides between the blue tilapia and its close relative Nile tilapia proved that these immunological genes are highly similar with a genome-wide scattering distribution. As a valuable genetic resource, our blue tilapia genome assembly will benefit for biomedical researches and practical molecular breeding for high resistance to various diseases, which have been a critical problem in the aquaculture of tilapias.


Proceedings ◽  
2020 ◽  
Vol 76 (1) ◽  
pp. 10
Author(s):  
Azamat Totikov ◽  
Andrey Tomarovsky ◽  
Lorena Derezanin ◽  
Olga Dudchenko ◽  
Erez Lieberman-Aiden ◽  
...  

Genome assemblies are becoming increasingly important for understanding genetic diversity in threatened species. However, due to limited budgets in the area of conservation biology, genome assemblies, when available, tend to be highly fragmented with tens of thousands of scaffolds. The recent advent of high throughput chromosome conformation capture (Hi-C) makes it possible to generate more contiguous assemblies containing scaffolds that are length of entire chromosomes. Such assemblies greatly facilitate analyses and visualization of genome-wide features. We compared genetic diversity in seven threatened species that had both draft genome assemblies and newer chromosome-level assemblies available. Chromosome-level assemblies allowed better estimation of genetic diversity, localization, and, especially, visualization of low heterozygosity regions in the genomes.


2018 ◽  
Author(s):  
Jeramiah J. Smith ◽  
Nataliya Timoshevskaya ◽  
Vladimir A. Timoshevskiy ◽  
Melissa C. Keinath ◽  
Drew Hardy ◽  
...  

ABSTRACTThe axolotl (Ambystoma mexicanum) provides critical models for studying regeneration, evolution and development. However, its large genome (~32 gigabases) presents a formidable barrier to genetic analyses. Recent efforts have yielded genome assemblies consisting of thousands of unordered scaffolds that resolve gene structures, but do not yet permit large scale analyses of genome structure and function. We adapted an established mapping approach to leverage dense SNP typing information and for the first time assemble the axolotl genome into 14 chromosomes. Moreover, we used fluorescence in situ hybridization to verify the structure of these 14 scaffolds and assign each to its corresponding physical chromosome. This new assembly covers 27.3 gigabases and encompasses 94% of annotated gene models on chromosomal scaffolds. We show the assembly’s utility by resolving genome-wide orthologies between the axolotl and other vertebrates, identifying the footprints of historical introgression events that occurred during the development of axolotl genetic stocks, and precisely mapping several phenotypes including a large deletion underlying the cardiac mutant. This chromosome-scale assembly will greatly facilitate studies of the axolotl in biological research.


2021 ◽  
Author(s):  
Lauren Coombe ◽  
Janet X Li ◽  
Theodora Lo ◽  
Johnathan Wong ◽  
Vladimir Nikolic ◽  
...  

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.


Author(s):  
Lin Kang ◽  
Pawel Michalak ◽  
Eric Hallerman ◽  
Nancy D Moncrief

Abstract The eastern fox squirrel, Sciurus niger, exhibits marked geographic variation in size and coat color, is a model organism for studies of behavior and ecology, and a potential model for investigating physiological solutions to human porphyrias. We assembled a genome using Illumina HiSeq, PacBio SMRT, and Oxford Nanopore MinION sequencing platforms. Together, the sequencing data resulted in a draft genome of 2.99 Gb, containing 32,830 scaffolds with an average size of 90.9 Kb and N50 of 183.8 Kb. Genome completeness was estimated to be 93.78%. A total of 24,443 protein-encoding genes were predicted from the assembly and 23,079 (94.42%) were annotated. Repeat elements comprised an estimated 38.49% of the genome, with the majority being LINEs (13.92%), SINEs (6.04%), and LTR elements. The topology of the species tree reconstructed using maximum-likelihood phylogenetic analysis was congruent with those of previous studies. This genome assembly can prove useful for comparative studies of genome structure and function in this rapidly diversifying lineage of mammals, for studies of population genomics and adaptation, and for biomedical research. Predicted amino acid sequence alignments for genes affecting heme biosynthesis, color vision, and hibernation showed point mutations and indels that may affect protein function and ecological adaptation.


Author(s):  
Stephen R. Doyle ◽  
Alan Tracey ◽  
Roz Laing ◽  
Nancy Holroyd ◽  
David Bartley ◽  
...  

AbstractBackgroundHaemonchus contortus is a globally distributed and economically important gastrointestinal pathogen of small ruminants, and has become the key nematode model for studying anthelmintic resistance and other parasite-specific traits among a wider group of parasites including major human pathogens. Two draft genome assemblies for H. contortus were reported in 2013, however, both were highly fragmented, incomplete, and differed from one another in important respects. While the introduction of long-read sequencing has significantly increased the rate of production and contiguity of de novo genome assemblies broadly, achieving high quality genome assemblies for small, genetically diverse, outcrossing eukaryotic organisms such as H. contortus remains a significant challenge.ResultsHere, we report using PacBio long read and OpGen and 10X Genomics long-molecule methods to generate a highly contiguous 283.4 Mbp chromosome-scale genome assembly including a resolved sex chromosome. We show a remarkable pattern of almost complete conservation of chromosome content (synteny) with Caenorhabditis elegans, but almost no conservation of gene order. Long-read transcriptome sequence data has allowed us to define coordinated transcriptional regulation throughout the life cycle of the parasite, and refine our understanding of cis- and trans-splicing relative to that observed in C. elegans. Finally, we use this assembly to give a comprehensive picture of chromosome-wide genetic diversity both within a single isolate and globally.ConclusionsThe H. contortus MHco3(ISE).N1 genome assembly presented here represents the most contiguous and resolved nematode assembly outside of the Caenorhabditis genus to date, together with one of the highest-quality set of predicted gene features. These data provide a high-quality comparison for understanding the evolution and genomics of Caenorhabditis and other nematodes, and extends the experimental tractability of this model parasitic nematode in understanding pathogen biology, drug discovery and vaccine development, and important adaptive traits such as drug resistance.


2015 ◽  
Author(s):  
John Davey ◽  
Mathieu Chouteau ◽  
Sarah L. Barker ◽  
Luana Maroja ◽  
Simon W. Baxter ◽  
...  

The Heliconius butterflies are a widely studied adaptive radiation of 46 species spread across Central and South America, several of which are known to hybridise in the wild. Here, we present a substantially improved assembly of the Heliconius melpomene genome, developed using novel methods that should be applicable to improving other genome assemblies produced using short read sequencing. Firstly, we whole genome sequenced a pedigree to produce a linkage map incorporating 99% of the genome. Secondly, we incorporated haplotype scaffolds extensively to produce a more complete haploid version of the draft genome. Thirdly, we incorporated ~20x coverage of Pacific Biosciences sequencing and scaffolded the haploid genome using an assembly of this long read sequence. These improvements result in a genome of 795 scaffolds, 275 Mb in length, with an L50 of 2.1 Mb, an N50 of 34 and with 99% of the genome placed and 84% anchored on chromosomes. We use the new genome assembly to confirm that the Heliconius genome underwent 10 chromosome fusions since the split with its sister genus Eueides, over a period of about 6 million years.


Sign in / Sign up

Export Citation Format

Share Document