Sequencing smart: De novo sequencing and assembly approaches for a non-model mammal

Abstract Background Whilst much sequencing effort has focused on key mammalian model organisms such as mouse and human, little is known about the relationship between genome sequencing techniques for non-model mammals and genome assembly quality. This is especially relevant to non-model mammals, where the samples to be sequenced are often degraded and of low quality. A key aspect when planning a genome project is the choice of sequencing data to generate. This decision is driven by several factors, including the biological questions being asked, the quality of DNA available, and the availability of funds. Cutting-edge sequencing technologies now make it possible to achieve highly contiguous, chromosome-level genome assemblies, but rely on high-quality high molecular weight DNA. However, funding is often insufficient for many independent research groups to use these techniques. Here we use a range of different genomic technologies generated from a roadkill European polecat (Mustela putorius) to assess various assembly techniques on this low-quality sample. We evaluated different approaches for de novo assemblies and discuss their value in relation to biological analyses. Results Generally, assemblies containing more data types achieved better scores in our ranking system. However, when accounting for misassemblies, this was not always the case for Bionano and low-coverage 10x Genomics (for scaffolding only). We also find that the extra cost associated with combining multiple data types is not necessarily associated with better genome assemblies. Conclusions The high degree of variability between each de novo assembly method (assessed from the 7 key metrics) highlights the importance of carefully devising the sequencing strategy to be able to carry out the desired analysis. Adding more data to genome assemblies does not always result in better assemblies, so it is important to understand the nuances of genomic data integration explained here, in order to obtain cost-effective value for money when sequencing genomes.

Download Full-text

Sequencing smart: De novo sequencing and assembly approaches for non-model mammals

10.1101/723890 ◽

2019 ◽

Cited By ~ 1

Author(s):

Graham J Etherington ◽

Darren Heavens ◽

David Baker ◽

Ashleigh Lister ◽

Rose McNelly ◽

...

Keyword(s):

De Novo ◽

Genome Project ◽

Model Organisms ◽

Value For Money ◽

Sequencing Data ◽

Data Types ◽

High Molecular Weight Dna ◽

Assembly Method ◽

A Genome ◽

Genome Assemblies

AbstractBackgroundWhilst much sequencing effort has focused on key mammalian model organisms such as mouse and human, little is known about the correlation between genome sequencing techniques for non-model mammals and genome assembly quality. This is especially relevant to non-model mammals, where the samples to be sequenced are often degraded and low quality. A key aspect when planning a genome project is the choice of sequencing data to generate. This decision is driven by several factors, including the biological questions being asked, the quality of DNA available, and the availability of funds. Cutting-edge sequencing technologies now make it possible to achieve highly contiguous, chromosome-level genome assemblies, but relies on good quality high-molecular-weight DNA. The funds to generate and combining these data are often only available within large consortiums and sequencing initiatives, and are often not affordable for many independent research groups. For many researchers, value-for-money is a key factor when considering the generation of genomic sequencing data. Here we use a range of different genomic technologies generated from a roadkill European Polecat (Mustela putorius) to assess various assembly techniques on this low-quality sample. We evaluated different approaches for de novo assemblies and discuss their value in relation to biological analyses.ResultsGenerally, assemblies containing more data types achieved better scores in our ranking system. However, when accounting for misassemblies, this was not always the case for Bionano and low-coverage 10x Genomics (for scaffolding only). We also find that the extra cost associated with combining multiple data types is not necessarily associated with better genome assemblies.ConclusionsThe high degree of variability between each de novo assembly method (assessed from the seven key metrics) highlights the importance of carefully devising the sequencing strategy to be able to carry out the desired analysis. Adding more data to genome assemblies not always results in better assemblies so it is important to understand the nuances of genomic data integration explained here, in order to obtain cost-effective value-for-money when sequencing genomes.

Download Full-text

Comparison of long read methods for sequencing and assembly of a plant genome

10.1101/2020.03.16.992933 ◽

2020 ◽

Cited By ~ 1

Author(s):

Valentine Murigneux ◽

Subash Kumar Rai ◽

Agnelo Furtado ◽

Timothy J.C. Bruxner ◽

Wei Tian ◽

...

Keyword(s):

De Novo ◽

Cost Effective ◽

Genome Project ◽

Plant Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

The Cost ◽

Genome Assemblies

AbstractSequencing technologies have advanced to the point where it is possible to generate high accuracy, haplotype resolved, chromosome scale assemblies. Several long read sequencing technologies are available on the market and a growing number of algorithms have been developed over the last years to assemble the reads generated by those technologies. When starting a new genome project, it is therefore challenging to select the most cost-effective sequencing technology as well as the most appropriate software for assembly and polishing. For this reason, it is important to benchmark different approaches applied to the same sample. Here, we report a comparison of three long read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. We have generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION) and BGI (single-tube Long Fragment Read) technologies for the same sample. Several assemblers were benchmarked in the assembly of PacBio and Nanopore reads. Results obtained from combining long read technologies or short read and long read technologies are also presented. The assemblies were compared for contiguity, accuracy and completeness as well as sequencing costs and DNA material requirements. Overall, the three long read technologies produced highly contiguous and complete genome assemblies of Macadamia jansenii. At the time of sequencing, the cost associated with each method was significantly different but continuous improvements in technologies have resulted in greater accuracy, increased throughput and reduced costs. We propose updating this comparison regularly with reports on significant iterations of the sequencing technologies.

Download Full-text

Improvement of the threespine stickleback (Gasterosteus aculeatus) genome using a Hi-C-based Proximity-Guided Assembly method

10.1101/068528 ◽

2016 ◽

Cited By ~ 2

Author(s):

Catherine L. Peichel ◽

Shawn T. Sullivan ◽

Ivan Liachko ◽

Michael A. White

Keyword(s):

Genome Assembly ◽

Gasterosteus Aculeatus ◽

De Novo ◽

Evolutionary Genetics ◽

Threespine Stickleback ◽

Linkage Groups ◽

High Molecular Weight Dna ◽

Assembly Method ◽

Guided Assembly ◽

Genome Assemblies

AbstractScaffolding genomes into complete chromosome assemblies remains challenging even with the rapidly increasing sequence coverage generated by current next-generation sequence technologies. Even with scaffolding information, many genome assemblies remain incomplete. The genome of the threespine stickleback (Gasterosteus aculeatus), a fish model system in evolutionary genetics and genomics, is not completely assembled despite scaffolding with high-density linkage maps. Here, we first test the ability of a Hi-C based proximity guided assembly to perform a de novo genome assembly from relatively short contigs. Using Hi-C based proximity guided assembly, we generated complete chromosome assemblies from 50 kb contigs. We found that 98.99% of contigs were correctly assigned to linkage groups, with ordering nearly identical to the previous genome assembly. Using available BAC end sequences, we provide evidence that some of the few discrepancies between the Hi-C assembly and the existing assembly are due to structural variation between the populations used for the two assemblies or errors in the existing assembly. This Hi-C assembly also allowed us to improve the existing assembly, assigning over 60% (13.35 Mb) of the previously unassigned (∼21.7 Mb) contigs to linkage groups. Together, our results highlight the potential of the Hi-C based proximity guided assembly method to be used in combination with short read data to perform relatively inexpensive de novo genome assemblies. This approach will be particularly useful in organisms in which it is difficult to perform linkage mapping or to obtain high molecular weight DNA required for other scaffolding methods.

Download Full-text

MobiSeq: De Novo SNP discovery in model and non-model species through sequencing the flanking region of transposable elements

10.1101/349290 ◽

2018 ◽

Author(s):

Alba Rey-Iglesia ◽

Shyam Gopalakrishan ◽

Christian Carøe ◽

David E. Alquezar-Planas ◽

Anne Ahlmann Nielsen ◽

...

Keyword(s):

Transposable Elements ◽

Dna Sequences ◽

Population Genomics ◽

De Novo ◽

Model Organisms ◽

Snp Discovery ◽

High Molecular Weight Dna ◽

A Genome ◽

Wide Range ◽

Flanking Region

AbstractIn recent years, the availability of reduced representation library (RRL) methods has catalysed an expansion of genome-scale studies to characterize both model and non-model organisms. Most of these methods rely on the use of restriction enzymes to obtain DNA sequences at a genome-wide level. These approaches have been widely used to sequence thousands of markers across individuals for many organisms at a reasonable cost, revolutionizing the field of population genomics. However, there are still some limitations associated with these methods, in particular, the high molecular weight DNA required as starting material, the reduced number of common loci among investigated samples, and the short length of the sequenced site-associated DNA. Here, we present MobiSeq, a RRL protocol exploiting simple laboratory techniques, that generates genomic data based on PCR targeted-enrichment of transposable elements and the sequencing of the associated flanking region. We validate its performance across 103 DNA extracts derived from three mammalian species: grey wolf (Canis lupus), red deer complex (Cervus sp.), and brown rat (Rattus norvegicus). MobiSeq enables the sequencing of hundreds of thousands loci across the genome, and performs SNP discovery with relatively low rates of clonality. Given the ease and flexibility of MobiSeq protocol, the method has the potential to be implemented for marker discovery and population genomics across a wide range of organisms – enabling the exploration of diverse evolutionary and conservation questions.

Download Full-text

Twelve quick steps for genome assembly and annotation in the classroom

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008325 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008325

Author(s):

Hyungtaek Jung ◽

Tomer Ventura ◽

J. Sook Chung ◽

Woo-Jin Kim ◽

Bo-Hye Nam ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Repetitive Sequences ◽

Genome Project ◽

Model Organisms ◽

High Quality ◽

Sequencing Technologies ◽

A Genome ◽

Sequencing Platforms ◽

High Quality Genome

Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.

Download Full-text

Comparison of long-read methods for sequencing and assembly of a plant genome

GigaScience ◽

10.1093/gigascience/giaa146 ◽

2020 ◽

Vol 9 (12) ◽

Author(s):

Valentine Murigneux ◽

Subash Kumar Rai ◽

Agnelo Furtado ◽

Timothy J C Bruxner ◽

Wei Tian ◽

...

Keyword(s):

De Novo ◽

Cost Effective ◽

Genome Project ◽

Plant Genome ◽

Sequencing Data ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

The Cost

Abstract Background Sequencing technologies have advanced to the point where it is possible to generate high-accuracy, haplotype-resolved, chromosome-scale assemblies. Several long-read sequencing technologies are available, and a growing number of algorithms have been developed to assemble the reads generated by those technologies. When starting a new genome project, it is therefore challenging to select the most cost-effective sequencing technology, as well as the most appropriate software for assembly and polishing. It is thus important to benchmark different approaches applied to the same sample. Results Here, we report a comparison of 3 long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. We have generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies for the same sample. Several assemblers were benchmarked in the assembly of Pacific Biosciences and Nanopore reads. Results obtained from combining long-read technologies or short-read and long-read technologies are also presented. The assemblies were compared for contiguity, base accuracy, and completeness, as well as sequencing costs and DNA material requirements. Conclusions The 3 long-read technologies produced highly contiguous and complete genome assemblies of M. jansenii. At the time of sequencing, the cost associated with each method was significantly different, but continuous improvements in technologies have resulted in greater accuracy, increased throughput, and reduced costs. We propose updating this comparison regularly with reports on significant iterations of the sequencing technologies.

Download Full-text

An International Campaign for Agricultural and Livestock Genomics (CALG)

Asia-Pacific Biotech News ◽

10.1142/s0219030302001970 ◽

2002 ◽

Vol 06 (24) ◽

pp. 958-965

Author(s):

Jun Yu ◽

Jian Wang ◽

Huanming Yang

Keyword(s):

Large Scale ◽

Cost Effective ◽

Model Organisms ◽

Environmental Biology ◽

Cdna Sequences ◽

Governmental Agencies ◽

Technology Innovations ◽

A Genome ◽

Starting Point ◽

The Cost

A coordinated international effort to sequence agricultural and livestock genomes has come to its time. While human genome and genomes of many model organisms (related to human health and basic biological interests) have been sequenced or plugged in the sequencing pipelines, agronomically important crop and livestock genomes have not been given high enough priority. Although we are facing many challenges in policy-making, grant funding, regional task emphasis, research community consensus and technology innovations, many initiatives are being announced and formulated based on the cost-effective and large-scale sequencing procedure, known as whole genome shotgun (WGS) sequencing that produces draft sequences covering a genome from 95 percent to 99 percent. Identified genes from such draft sequences, coupled with other resources, such as molecular markers, large-insert clones and cDNA sequences, provide ample information and tools to further our knowledge in agricultural and environmental biology in the genome era that just comes to its accelerated period. If the campaign succeeds, molecular biologists, geneticists and field biologists from all countries, rich or poor, would be brought to the same starting point and expect another astronomical increase of basic genomic information, ready to convert effectively into knowledge that will ultimately change our lives and environment into a greater and better future. We call upon national and international governmental agencies and organizations as well as research foundations to support this unprecedented movement.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A practical guide to buildde-novoassemblies for single tissues of non-model organisms: the example of a Neotropical frog

PeerJ ◽

10.7717/peerj.3702 ◽

2017 ◽

Vol 5 ◽

pp. e3702 ◽

Cited By ~ 5

Author(s):

Santiago Montero-Mendieta ◽

Manfred Grabherr ◽

Henrik Lantz ◽

Ignacio De la Riva ◽

Jennifer A. Leonard ◽

...

Keyword(s):

Defense Mechanisms ◽

De Novo ◽

Transcriptome Assembly ◽

Cost Effective ◽

Model Organisms ◽

Rna Seq ◽

Assembly Pipeline ◽

Wide Variability ◽

History Of ◽

Inexperienced User

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.

Download Full-text

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C

Bioinformatics ◽

10.1093/bioinformatics/btaa288 ◽

2020 ◽

Vol 36 (13) ◽

pp. 3966-3974

Author(s):

Ryo Nakabayashi ◽

Shinichi Morishita

Keyword(s):

Viterbi Algorithm ◽

De Novo ◽

Gene Prediction ◽

Effective Means ◽

Cost Effective ◽

Synteny Block ◽

Chromosome Length ◽

Model Organisms ◽

Contact Frequency ◽

Reference Quality

Abstract Motivation De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics. Results To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs. Availability and implementation HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.

Download Full-text