Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment

AbstractAlthough genome sequencing is becoming cheaper and faster, reducing the quantity of data by only sequencing part of the genome lowers both sequencing costs and computational burdens. One popular genome-reduction approach is restriction site associated DNA sequencing, or RADseq. RADseq was initially designed for studying genetic variation across genomes usually at the population level, and it has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for non-model organisms; nonetheless, alignment-free methods have only been applied with whole genome sequences. Here, we test a full-genome assembly and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove missing data. We validate these methods using both simulations and a real dataset. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the real dataset, making AAF comparable to or better than alignment-based method with much lower computation burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).

Download Full-text

Positive selection signatures in Anqing six‐end‐white pig population based on reduced‐representation genome sequencing data

Animal Genetics ◽

10.1111/age.13034 ◽

2021 ◽

Author(s):

L. Guo ◽

H. Sun ◽

Q. Zhao ◽

Z. Xu ◽

Z. Zhang ◽

...

Keyword(s):

Positive Selection ◽

Genome Sequencing ◽

Population Based ◽

Sequencing Data ◽

Selection Signatures ◽

Reduced Representation

Download Full-text

Comparing divergence landscapes from reduced-representation and whole-genome re-sequencing in the yellow-rumped warbler (Setophaga coronata) species complex

10.1101/2021.03.23.436663 ◽

2021 ◽

Author(s):

Stephanie Szarmach ◽

Alan Brelsford ◽

Christopher C Witt ◽

David Toews

Keyword(s):

Species Complex ◽

Mitochondrial Gene ◽

Model Organisms ◽

Whole Genome ◽

Sequencing Data ◽

Reduced Representation ◽

Mitochondrial Haplotypes ◽

Trade Offs ◽

Sequencing Method ◽

The Cost

Researchers seeking to generate genomic data for non-model organisms are faced with a number of trade-offs when deciding which method to use. The selection of reduced representation approaches versus whole genome re-sequencing will ultimately affect the marker density, sequencing depth, and the number of individuals that can multiplexed. These factors can affect researchers' ability to accurately characterize certain genomic features, such as landscapes of divergence-how FST varies across the genomes. To provide insight into the effect of sequencing method on the estimation of divergence landscapes, we applied an identical bioinformatic pipeline to three generations of sequencing data (GBS, ddRAD, and WGS) produced for the same system, the yellow-rumped warbler species complex. We compare divergence landscapes generated using each method for the myrtle warbler (Setophaga coronata coronata) and the Audubon's warbler (S. c. auduboni), and for Audubon's warblers with deeply divergent mtDNA resulting from mitochondrial introgression. We found that most high-FST peaks were not detected in the ddRAD dataset, and that while both GBS and WGS were able to identify the presence of large peaks, WGS was superior at a finer scale. Comparing Audubon's warblers with divergent mitochondrial haplotypes, only WGS allowed us to identify small (10-20kb) regions of elevated differentiation, one of which contained the nuclear-encoded mitochondrial gene NDUFAF3. We calculated the cost per base pair for each method and found it was comparable between GBS and WGS, but significantly higher for ddRAD. These comparisons highlight the advantages of WGS over reduced representation methods when characterizing landscapes of divergence.

Download Full-text

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

BMC Genomics ◽

10.1186/s12864-015-1647-5 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 68

Author(s):

Huan Fan ◽

Anthony R. Ives ◽

Yann Surget-Groba ◽

Charles H. Cannon

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Phylogeny Reconstruction ◽

Next Generation ◽

Sequencing Data ◽

Alignment Free ◽

Generation Sequencing

Download Full-text

Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment

Molecular Ecology Resources ◽

10.1111/1755-0998.12921 ◽

2018 ◽

Vol 18 (6) ◽

pp. 1482-1491 ◽

Cited By ~ 2

Author(s):

Huan Fan ◽

Anthony R. Ives ◽

Yann Surget-Groba

Keyword(s):

Genome Sequencing ◽

Sequencing Data ◽

Reduced Representation

Download Full-text

ViR: a tool to solve intrasample variability in the prediction of viral integration sites using whole genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-03980-5 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Elisa Pischedda ◽

Cristina Crava ◽

Martina Carlassara ◽

Susanna Zucca ◽

Leila Gasmi ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Model Organisms ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Regulation Of Expression ◽

Transfer Event ◽

A Genome ◽

Integration Sites

Abstract Background Several bioinformatics pipelines have been developed to detect sequences from viruses that integrate into the human genome because of the health relevance of these integrations, such as in the persistence of viral infection and/or in generating genotoxic effects, often progressing into cancer. Recent genomics and metagenomics analyses have shown that viruses also integrate into the genome of non-model organisms (i.e., arthropods, fish, plants, vertebrates). However, rarely studies of endogenous viral elements (EVEs) in non-model organisms have gone beyond their characterization from reference genome assemblies. In non-model organisms, we lack a thorough understanding of the widespread occurrence of EVEs and their biological relevance, apart from sporadic cases which nevertheless point to significant roles of EVEs in immunity and regulation of expression. The concomitance of repetitive DNA, duplications and/or assembly fragmentations in a genome sequence and intrasample variability in whole-genome sequencing (WGS) data could determine misalignments when mapping data to a genome assembly. This phenomenon hinders our ability to properly identify integration sites. Results To fill this gap, we developed ViR, a pipeline which solves the dispersion of reads due to intrasample variability in sequencing data from both single and pooled DNA samples thus ameliorating the detection of integration sites. We tested ViR to work with both in silico and real sequencing data from a non-model organism, the arboviral vector Aedes albopictus. Potential viral integrations predicted by ViR were molecularly validated supporting the accuracy of ViR results. Conclusion ViR will open new venues to explore the biology of EVEs, especially in non-model organisms. Importantly, while we generated ViR with the identification of EVEs in mind, its application can be extended to detect any lateral transfer event providing an ad-hoc sequence to interrogate.

Download Full-text

Alternative applications of whole genome de novo assembly in animal genomics

10.32469/10355/62342 ◽

2017 ◽

Author(s):

◽

Lynsey Whitacre

Keyword(s):

Genome Sequencing ◽

De Novo Assembly ◽

Deoxyribonucleic Acid ◽

De Novo ◽

Population Level ◽

Sequencing Data ◽

Complete Set ◽

Animal Genomics ◽

Assembly Algorithms ◽

Insight Into

Genome sequencing is the process by which the sequence of deoxyribonucleic acid (DNA) residues that compromise the genome, or complete set of genetic materials of an organism or individual, is determined. Down-stream analysis of genome sequencing data requires that short reads be compiled into contiguous sequences. These methods, called de novo assembly, are based in statistical methods and graph theory. In addition to genome assembly, the research presented in this dissertation demonstrates the alternative use of these methods. Using these novel approaches, de novo assembly algorithms can be utilized to gain insight into commensal and parasitic organisms of livestock, genes containing candidate mutations for genetic defects, and population-level and species-level variation in a poorly studied organisms.

Download Full-text

From whole genome sequencing data toward a simple genotyping tool: application to the animal pathogen Mycobacterium bovis

10.26226/morressier.56d5ba2ad462b80296c965c0 ◽

2016 ◽

Author(s):

Lorraine Michelet

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

High-precision and cost-efficient sequencing for real-time COVID-19 surveillance

Scientific Reports ◽

10.1038/s41598-021-93145-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung Yong Park ◽

Gina Faraci ◽

Pamela M. Ward ◽

Jane F. Emerson ◽

Ha Youn Lee

Keyword(s):

Los Angeles ◽

Whole Genome Sequencing ◽

Real Time ◽

Genome Sequencing ◽

High Precision ◽

High Throughput Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Public Health Response ◽

Cost Efficient

AbstractCOVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.

Download Full-text

An Alignment-free Heuristic for Fast Sequence Comparisons with Applications to Phylogeny Reconstruction

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '18 ◽

10.1145/3233547.3233648 ◽

2018 ◽

Author(s):

Jodh Pannu ◽

Sriram P. Chockalingam ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogeny Reconstruction ◽

Sequence Comparisons ◽

Alignment Free

Download Full-text