scholarly journals Improved contiguity of the threespine stickleback genome using long-read sequencing

Author(s):  
Shivangi Nath ◽  
Daniel E. Shaw ◽  
Michael A. White

AbstractWhile the cost and time for assembling a genome have drastically reduced, it still remains a challenge to assemble a highly contiguous genome. These challenges are rapidly being overcome by the integration of long-read sequencing technologies. Here, we use long sequencing reads to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species. Using Pacific Biosciences sequencing, we were able to fill over 76% of the gaps in the genome, improving contiguity over five-fold. Our approach was highly accurate, validated by 10X Genomics long-distance linked-reads. In addition to closing a majority of gaps, we were able to assemble segments of telomeres and centromeres throughout the genome. This highlights the power of using long sequencing reads to assemble highly repetitive and difficult to assemble regions of genomes. This latest genome build has been released through a newly designed community genome browser that aims to consolidate the growing number of genomics datasets available for the threespine stickleback fish.

2021 ◽  
Vol 11 (2) ◽  
Author(s):  
Shivangi Nath ◽  
Daniel E Shaw ◽  
Michael A White

Abstract While the cost and time for assembling a genome has drastically decreased, it still remains a challenge to assemble a highly contiguous genome. These challenges are rapidly being overcome by the integration of long-read sequencing technologies. Here, we use long-read sequencing to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species. Using Pacific Biosciences sequencing, we assembled a highly contiguous genome of a freshwater fish from Paxton Lake. Using contigs from this genome, we were able to fill over 76.7% of the gaps in the existing reference genome assembly, improving contiguity over fivefold. Our gap filling approach was highly accurate, validated by 10X Genomics long-distance linked-reads. In addition to closing a majority of gaps, we were able to assemble segments of telomeres and centromeres throughout the genome. This highlights the power of using long sequencing reads to assemble highly repetitive and difficult to assemble regions of genomes. This latest genome build has been released through a newly designed community genome browser that aims to consolidate the growing number of genomics datasets available for the threespine stickleback fish.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Edwin A. Solares ◽  
Yuan Tao ◽  
Anthony D. Long ◽  
Brandon S. Gaut

Abstract Background Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. Results Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). Conclusions HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.


2019 ◽  
Vol 6 (2) ◽  
pp. 180608 ◽  
Author(s):  
Marvin Choquet ◽  
Irina Smolina ◽  
Anusha K. S. Dhanasiri ◽  
Leocadio Blanco-Bercial ◽  
Martina Kopp ◽  
...  

Advances in next-generation sequencing technologies and the development of genome-reduced representation protocols have opened the way to genome-wide population studies in non-model species. However, species with large genomes remain challenging, hampering the development of genomic resources for a number of taxa including marine arthropods. Here, we developed a genome-reduced representation method for the ecologically important marine copepod Calanus finmarchicus (haploid genome size of 6.34 Gbp). We optimized a capture enrichment-based protocol based on 2656 single-copy genes, yielding a total of 154 087 high-quality SNPs in C. finmarchicus including 62 372 in common among the three locations tested. The set of capture probes was also successfully applied to the congeneric C. glacialis . Preliminary analyses of these markers revealed similar levels of genetic diversity between the two Calanus species, while populations of C. glacialis showed stronger genetic structure compared to C. finmarchicus . Using this powerful set of markers, we did not detect any evidence of hybridization between C. finmarchicus and C. glacialis . Finally, we propose a shortened version of our protocol, offering a promising solution for population genomics studies in non-model species with large genomes.


Author(s):  
Clément Schneider ◽  
Christian Woehle ◽  
Carola Greve ◽  
Cyrille A. D’Haese ◽  
Magnus Wolf ◽  
...  

ABSTRACTGenome sequencing of all known eukaryotes on Earth promises unprecedented advances in evolutionary sciences, ecology, systematics and in biodiversity-related applied fields such as environmental management and natural product research. Advances in DNA sequencing technologies make genome sequencing feasible for many non-genetic model species. However, genome sequencing today relies on large quantities of high quality, high molecular weight (HMW) DNA which is mostly obtained from fresh tissues. This is problematic for biodiversity genomics of Metazoa as most species are small and yield minute amounts of DNA. Furthermore, briging living specimens to the lab bench not realistic for the majority of species.Here we overcome those difficulties by sequencing two species of springtails (Collembola) from single specimens preserved in ethanol. We used a newly developed, genome-wide amplification-based protocol to generate PacBio libraries for HiFi long-read sequencing.The assembled genomes were highly continuous. They can be considered complete as we recovered over 95% of BUSCOs. Genome-wide amplification does not seem to bias genome recovery. Presence of almost complete copies of the mitochondrial genome in the nuclear genome were pitfalls for automatic assemblers. The genomes fit well into an existing phylogeny of springtails. A neotype is designated for one of the species, blending genome sequencing and creation of taxonomic references.Our study shows that it is possible to obtain high quality genomes from small, field-preserved sub-millimeter metazoans, thus making their vast diversity accessible to the fields of genomics.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Chih-Ming Hung ◽  
Ai-Yun Yu ◽  
Yu-Ting Lai ◽  
Pei-Jen L. Shaner

Abstract Microsatellites have a wide range of applications from behavioral biology, evolution, to agriculture-based breeding programs. The recent progress in the next-generation sequencing technologies and the rapidly increasing number of published genomes may greatly enhance the current applications of microsatellites by turning them from anonymous to informative markers. Here we developed an approach to anchor microsatellite markers of any target species in a genome of a related model species, through which the genomic locations of the markers, along with any functional genes potentially linked to them, can be revealed. We mapped the shotgun sequence reads of a non-model rodent species Apodemus semotus against the genome of a model species, Mus musculus, and presented 24 polymorphic microsatellite markers with detailed background information for A. semotus in this study. The developed markers can be used in other rodent species, especially those that are closely related to A. semotus or M. musculus. Compared to the traditional approaches based on DNA cloning, our approach is likely to yield more loci for the same cost. This study is a timely demonstration of how a research team can efficiently generate informative (neutral or function-associated) microsatellite markers for their study species and unique biological questions.


2017 ◽  
Author(s):  
David Porubsky ◽  
Shilpa Garg ◽  
Ashley D. Sanders ◽  
Jan O. Korbel ◽  
Victor Guryev ◽  
...  

ABSTRACTThe diploid nature of the genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. Many important biological phenomena such as compound heterozygosity and epistatic effects between enhancers and target genes, however, can only be studied when haplotype-resolved genomes are available. This lack of haplotype-level analyses can be explained by a dearth of methods to produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. Our experiments provide comprehensive guidance on favorable combinations of Strand-seq libraries and sequencing coverages to obtain complete and genome-wide haplotypes of a single individual genome (NA12878) at manageable costs. We were able to reliably assign > 95% of alleles to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different sequencing technologies represents an attractive solution to chart the unique genetic variation of diploid genomes.


2021 ◽  
Author(s):  
Louise Aigrain

Since the publication of the first draft of the human genome 20 years ago, several novel sequencing technologies have emerged. Whilst some drive the cost of DNA sequencing down, others address the difficult parts of the genome which remained inaccessible so far. But the next-generation sequencing (NGS) landscape is a fast-changing environment and one can easily get lost between second- and third- generation sequencers, or the pros and cons of short- versus long-read technologies. In this beginner’s guide to NGS, we will review the main NGS technologies available in 2021. We will compare sample preparation protocols and sequencing methods, highlighting the requirements and advantages of each technology.


GigaScience ◽  
2020 ◽  
Vol 9 (12) ◽  
Author(s):  
Valentine Murigneux ◽  
Subash Kumar Rai ◽  
Agnelo Furtado ◽  
Timothy J C Bruxner ◽  
Wei Tian ◽  
...  

Abstract Background Sequencing technologies have advanced to the point where it is possible to generate high-accuracy, haplotype-resolved, chromosome-scale assemblies. Several long-read sequencing technologies are available, and a growing number of algorithms have been developed to assemble the reads generated by those technologies. When starting a new genome project, it is therefore challenging to select the most cost-effective sequencing technology, as well as the most appropriate software for assembly and polishing. It is thus important to benchmark different approaches applied to the same sample. Results Here, we report a comparison of 3 long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. We have generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies for the same sample. Several assemblers were benchmarked in the assembly of Pacific Biosciences and Nanopore reads. Results obtained from combining long-read technologies or short-read and long-read technologies are also presented. The assemblies were compared for contiguity, base accuracy, and completeness, as well as sequencing costs and DNA material requirements. Conclusions The 3 long-read technologies produced highly contiguous and complete genome assemblies of M. jansenii. At the time of sequencing, the cost associated with each method was significantly different, but continuous improvements in technologies have resulted in greater accuracy, increased throughput, and reduced costs. We propose updating this comparison regularly with reports on significant iterations of the sequencing technologies.


2012 ◽  
Vol 90 (12) ◽  
pp. 1386-1393 ◽  
Author(s):  
Kyrre Grøtan ◽  
Kjartan Østbye ◽  
Annette Taugbøl ◽  
L. Asbjørn Vøllestad

Marine threespine stickleback ( Gasterosteus aculatus L., 1758) have repeatedly colonized Holarctic freshwater environments after the retreat of the Pleistocene glaciers, and based on their ability to move rapidly between salinities have apparently retained a robust osmoregulatory apparatus that can cope with both short- and long-term exposure to non-native salinity environments. Standard metabolic rate (SMR), measured as oxygen consumption at rest, can be used as an indicator of the cost of osmoregulation when fish are exposed to new environmental conditions. Following freshwater colonization, reduction in the number of lateral plates, an antipredator defence structure, is common. Completely plated fish dominate in the sea, low-plated fish dominate in fresh water, and partially plated fish often dominate in brackish water environments. In a laboratory experiment, we estimated SMR in locally adapted populations from salt, brackish, and freshwater environments at three different salinities (0, 15, and 30 practical salinity units (PSU)). In addition, we tested for correlations between SMR and lateral plate number and lateral plate genotype at the Ectodysplasin locus for stickleback originating from the brackish water population. Contrary to our expectations, no differences were found in SMR between any of the experimental groups in our experiment. Apparently, the threespine stickleback is able to move among salinity environments without large short-term metabolic costs, irrespective of their environment of origin.


GigaScience ◽  
2021 ◽  
Vol 10 (5) ◽  
Author(s):  
Clément Schneider ◽  
Christian Woehle ◽  
Carola Greve ◽  
Cyrille A D'Haese ◽  
Magnus Wolf ◽  
...  

Abstract Background Genome sequencing of all known eukaryotes on Earth promises unprecedented advances in biological sciences and in biodiversity-related applied fields such as environmental management and natural product research. Advances in long-read DNA sequencing make it feasible to generate high-quality genomes for many non–genetic model species. However, long-read sequencing today relies on sizable quantities of high-quality, high molecular weight DNA, which is mostly obtained from fresh tissues. This is a challenge for biodiversity genomics of most metazoan species, which are tiny and need to be preserved immediately after collection. Here we present de novo genomes of 2 species of submillimeter Collembola. For each, we prepared the sequencing library from high molecular weight DNA extracted from a single specimen and using a novel ultra-low input protocol from Pacific Biosciences. This protocol requires a DNA input of only 5 ng, permitted by a whole-genome amplification step. Results The 2 assembled genomes have N50 values >5.5 and 8.5 Mb, respectively, and both contain ∼96% of BUSCO genes. Thus, they are highly contiguous and complete. The genomes are supported by an integrative taxonomy approach including placement in a genome-based phylogeny of Collembola and designation of a neotype for 1 of the species. Higher heterozygosity values are recorded in the more mobile species. Both species are devoid of the biosynthetic pathway for β-lactam antibiotics known in several Collembola, confirming the tight correlation of antibiotic synthesis with the species way of life. Conclusions It is now possible to generate high-quality genomes from single specimens of minute, field-preserved metazoans, exceeding the minimum contig N50 (1 Mb) required by the Earth BioGenome Project.


Sign in / Sign up

Export Citation Format

Share Document