One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

Download Full-text

One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

10.1101/2020.04.14.041004 ◽

2020 ◽

Author(s):

Carlos Valiente-Mullor ◽

Beatriz Beamud ◽

Iván Ansari ◽

Carlos Francés-Cuesta ◽

Neris García-González ◽

...

Keyword(s):

High Throughput ◽

Legionella Pneumophila ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Reference Genome ◽

Bacterial Species ◽

Genomic Diversity ◽

Reference Sequence ◽

The Impact ◽

Reference Genomes

AbstractMapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.Author summaryMapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species —a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. Biases/errors due to reference choice for mapping in bacteria have been identified. These are mainly originated in alignment errors due to genetic differences between the reference genome and the read sequences. Eventually, they could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). However, a systematic work on the effects of reference choice in different bacterial species is still missing, particularly regarding its impact on phylogenies. This work intended to fill that gap. The impact of reference choice has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.

Download Full-text

Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2

Briefings in Bioinformatics ◽

10.1093/bib/bbab042 ◽

2021 ◽

Author(s):

Liang Cheng ◽

Xudong Han ◽

Zijun Zhu ◽

Changlu Qi ◽

Ping Wang ◽

...

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Purifying Selection ◽

Virus Genome ◽

Receptor Binding Domain ◽

Evolutionary Trends ◽

Synonymous Mutations ◽

Almost All ◽

Virus Strains ◽

New Mutations

Abstract Since the first report of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in December 2019, the COVID-19 pandemic has spread rapidly worldwide. Due to the limited virus strains, few key mutations that would be very important with the evolutionary trends of virus genome were observed in early studies. Here, we downloaded 1809 sequence data of SARS-CoV-2 strains from GISAID before April 2020 to identify mutations and functional alterations caused by these mutations. Totally, we identified 1017 nonsynonymous and 512 synonymous mutations with alignment to reference genome NC_045512, none of which were observed in the receptor-binding domain (RBD) of the spike protein. On average, each of the strains could have about 1.75 new mutations each month. The current mutations may have few impacts on antibodies. Although it shows the purifying selection in whole-genome, ORF3a, ORF8 and ORF10 were under positive selection. Only 36 mutations occurred in 1% and more virus strains were further analyzed to reveal linkage disequilibrium (LD) variants and dominant mutations. As a result, we observed five dominant mutations involving three nonsynonymous mutations C28144T, C14408T and A23403G and two synonymous mutations T8782C, and C3037T. These five mutations occurred in almost all strains in April 2020. Besides, we also observed two potential dominant nonsynonymous mutations C1059T and G25563T, which occurred in most of the strains in April 2020. Further functional analysis shows that these mutations decreased protein stability largely, which could lead to a significant reduction of virus virulence. In addition, the A23403G mutation increases the spike-ACE2 interaction and finally leads to the enhancement of its infectivity. All of these proved that the evolution of SARS-CoV-2 is toward the enhancement of infectivity and reduction of virulence.

Download Full-text

Phylogenomics of orchids and their mycorrhizal fungi : trees, diversity, and the pursuit of symbiosis

10.32469/10355/72205 ◽

2019 ◽

Author(s):

◽

Sarah Unruh

Keyword(s):

Mycorrhizal Fungi ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Genomic Sequence ◽

Sequence Data ◽

Mycorrhizal Symbiosis ◽

Sequencing Data ◽

Phylogenetic Structure ◽

University Of Missouri ◽

Fungal Symbiosis

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Phylogenetic trees show us how organisms are related and provide frameworks for studying and testing evolutionary hypotheses. To better understand the evolution of orchids and their mycorrhizal fungi, I used high-throughput sequencing data and bioinformatic analyses, to build phylogenetic hypotheses. In Chapter 2, I used transcriptome sequences to both build a phylogeny of the slipper orchid genera and to confirm the placement of a polyploidy event at the base of the orchid family. Polyploidy is hypothesized to be a strong driver of evolution and a source of unique traits so confirming this event leads us closer to explaining extant orchid diversity. The list of orthologous genes generated from this study will provide a less expensive and more powerful method for researchers examining the evolutionary relationships in Orchidaceae. In Chapter 3, I generated genomic sequence data for 32 fungal isolates that were collected from orchids across North America. I inferred the first multi-locus nuclear phylogenetic tree for these fungal clades. The phylogenetic structure of these fungi will improve the taxonomy of these clades by providing evidence for new species and for revising problematic species designations. A robust taxonomy is necessary for studying the role of fungi in the orchid mycorrhizal symbiosis. In chapter 4 I summarize my work and outline the future directions of my lab at Illinois College including addressing the remaining aims of my Community Sequencing Proposal with the Joint Genome Institute by analyzing the 15 fungal reference genomes I generated during my PhD. Together these chapters are the start of a life-long research project into the evolution and function of the orchid/fungal symbiosis.

Download Full-text

The Theory and Applications of Measuring Broad-Range and Chromosome-Wide Recombination Rate from Allele Frequency Decay around a Selected Locus

Molecular Biology and Evolution ◽

10.1093/molbev/msaa171 ◽

2020 ◽

Vol 37 (12) ◽

pp. 3654-3671

Author(s):

Kevin H -C Wei ◽

Aditya Mantha ◽

Doris Bachtrog

Keyword(s):

Genetic Distance ◽

Allele Frequency ◽

Recombination Rate ◽

High Throughput Sequencing ◽

Genetic Material ◽

Cost Effective ◽

Genetic Distances ◽

Recombination Rates ◽

Marker Selection ◽

Cost Effective Approach

Abstract Recombination is the exchange of genetic material between homologous chromosomes via physical crossovers. High-throughput sequencing approaches detect crossovers genome wide to produce recombination rate maps but are difficult to scale as they require large numbers of recombinants individually sequenced. We present a simple and scalable pooled-sequencing approach to experimentally infer near chromosome-wide recombination rates by taking advantage of non-Mendelian allele frequency generated from a fitness differential at a locus under selection. As more crossovers decouple the selected locus from distal loci, the distorted allele frequency attenuates distally toward Mendelian and can be used to estimate the genetic distance. Here, we use marker selection to generate distorted allele frequency and theoretically derive the mathematical relationships between allele frequency attenuation, genetic distance, and recombination rate in marker-selected pools. We implemented nonlinear curve-fitting methods that robustly estimate the allele frequency decay from batch sequencing of pooled individuals and derive chromosome-wide genetic distance and recombination rates. Empirically, we show that marker-selected pools closely recapitulate genetic distances inferred from scoring recombinants. Using this method, we generated novel recombination rate maps of three wild-derived strains of Drosophila melanogaster, which strongly correlate with previous measurements. Moreover, we show that this approach can be extended to estimate chromosome-wide crossover interference with reciprocal marker selection and discuss how it can be applied in the absence of visible markers. Altogether, we find that our method is a simple and cost-effective approach to generate chromosome-wide recombination rate maps requiring only one or two libraries.

Download Full-text

Genotyping-by-sequencing enables linkage mapping in three octoploid cultivated strawberry families

10.7287/peerj.preprints.2975v1 ◽

2017 ◽

Author(s):

Kelly J Vining ◽

Natalia Salinas ◽

Jacob A Tennessen ◽

Jason D Zurn ◽

Daniel James Sargent ◽

...

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Genotyping By Sequencing ◽

Nucleotide Polymorphisms ◽

Linkage Groups ◽

Single Nucleotide ◽

Ancestral Species ◽

Polymorphic Snps ◽

Genome Wide ◽

Diploid Ancestor

With the goal of evaluating genotyping-by-sequencing (GBS) in a species with a complex octoploid genome, GBS was used to survey genome-wide single-nucleotide polymorphisms (SNPs) in three biparental strawberry (Fragaria ×ananassa) populations. GBS sequence data were aligned to the F. vesca ‘Fvb’ reference genome in order to call SNPs. Numbers of polymorphic SNPs per population ranged from 1,163 to 3,190. Linkage maps consisting of 30-65 linkage groups were produced from the SNP sets derived from each parent. The linkage groups covered 99% of the Fvb reference genome, with three to seven linkage groups from a given parent aligned to any particular chromosome. A phylogenetic analysis performed using the POLiMAPS pipeline revealed linkage groups that were most similar to ancestral species F. vesca for each chromosome. Linkage groups that were most similar to a second ancestral species, F. iinumae, were only resolved for Fvb 4. The quantity of missing data and heterogeneity in genome coverage inherent in GBS complicated the analysis, but POLiMAPS resolved F. ×ananassa chromosomal regions derived from diploid ancestor F. vesca.

Download Full-text

Genotyping-by-sequencing enables linkage mapping in three octoploid cultivated strawberry families

PeerJ ◽

10.7717/peerj.3731 ◽

2017 ◽

Vol 5 ◽

pp. e3731 ◽

Cited By ~ 11

Author(s):

Kelly J. Vining ◽

Natalia Salinas ◽

Jacob A. Tennessen ◽

Jason D. Zurn ◽

Daniel James Sargent ◽

...

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Genotyping By Sequencing ◽

Nucleotide Polymorphisms ◽

Linkage Groups ◽

Single Nucleotide ◽

Ancestral Species ◽

Polymorphic Snps ◽

Genome Wide ◽

Diploid Ancestor

Genotyping-by-sequencing (GBS) was used to survey genome-wide single-nucleotide polymorphisms (SNPs) in three biparental strawberry (Fragaria× ananassa) populations with the goal of evaluating this technique in a species with a complex octoploid genome. GBS sequence data were aligned to theF. vesca‘Fvb’ reference genome in order to call SNPs. Numbers of polymorphic SNPs per population ranged from 1,163 to 3,190. Linkage maps consisting of 30–65 linkage groups were produced from the SNP sets derived from each parent. The linkage groups covered 99% of theFvbreference genome, with three to seven linkage groups from a given parent aligned to any particular chromosome. A phylogenetic analysis performed using the POLiMAPS pipeline revealed linkage groups that were most similar to ancestral speciesF. vescafor each chromosome. Linkage groups that were most similar to a second ancestral species,F. iinumae, were only resolved forFvb4. The quantity of missing data and heterogeneity in genome coverage inherent in GBS complicated the analysis, but POLiMAPS resolvedF.× ananassachromosomal regions derived from diploid ancestorF. vesca.

Download Full-text

Single molecule, near full-length genome sequencing of dengue virus

Scientific Reports ◽

10.1038/s41598-020-75374-1 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Thiruni N. Adikari ◽

Nasir Riaz ◽

Chathurani Sigera ◽

Preston Leung ◽

Braulio M. Valencia ◽

...

Keyword(s):

Dengue Virus ◽

Single Molecule ◽

Phylogenetic Trees ◽

Sequence Data ◽

Sequence Similarity ◽

Genetic Distances ◽

Full Length ◽

Denv Serotypes ◽

Consensus Sequences ◽

Pairwise Sequence Similarity

Abstract Current methods for dengue virus (DENV) genome amplification, amplify parts of the genome in at least 5 overlapping segments and then combine the output to characterize a full genome. This process is laborious, costly and requires at least 10 primers per serotype, thus increasing the likelihood of PCR bias. We introduce an assay to amplify near full-length dengue virus genomes as intact molecules, sequence these amplicons with third generation “nanopore” technology without fragmenting and use the sequence data to differentiate within-host viral variants with a bioinformatics tool (Nano-Q). The new assay successfully generated near full-length amplicons from DENV serotypes 1, 2 and 3 samples which were sequenced with nanopore technology. Consensus DENV sequences generated by nanopore sequencing had over 99.5% pairwise sequence similarity to Illumina generated counterparts provided the coverage was > 100 with both platforms. Maximum likelihood phylogenetic trees generated from nanopore consensus sequences were able to reproduce the exact trees made from Illumina sequencing with a conservative 99% bootstrapping threshold (after 1000 replicates and 10% burn-in). Pairwise genetic distances of within host variants identified from the Nano-Q tool were less than that of between host variants, thus enabling the phylogenetic segregation of variants from the same host.

Download Full-text

High-throughput sequencing of SARS-CoV-2 in wastewater provides insights into circulating variants

10.1101/2021.01.22.21250320 ◽

2021 ◽

Author(s):

Rafaela S. Fontenele ◽

Simona Kraberger ◽

James Hadfield ◽

Erin M. Driver ◽

Devin Bowes ◽

...

Keyword(s):

Population Structure ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequence Data ◽

Genomic Diversity ◽

Contact Tracing ◽

Public Health Response ◽

Genetic Population ◽

Genomic Epidemiology ◽

Derived Data

AbstractSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged from a zoonotic spill-over event and has led to a global pandemic. The public health response has been predominantly informed by surveillance of symptomatic individuals and contact tracing, with quarantine, and other preventive measures have then been applied to mitigate further spread. Non-traditional methods of surveillance such as genomic epidemiology and wastewater-based epidemiology (WBE) have also been leveraged during this pandemic. Genomic epidemiology uses high-throughput sequencing of SARS-CoV-2 genomes to inform local and international transmission events, as well as the diversity of circulating variants. WBE uses wastewater to analyse community spread, as it is known that SARS-CoV-2 is shed through bodily excretions. Since both symptomatic and asymptomatic individuals contribute to wastewater inputs, we hypothesized that the resultant pooled sample of population-wide excreta can provide a more comprehensive picture of SARS-CoV-2 genomic diversity circulating in a community than clinical testing and sequencing alone. In this study, we analysed 91 wastewater samples from 11 states in the USA, where the majority of samples represent Maricopa County, Arizona (USA). With the objective of assessing the viral diversity at a population scale, we undertook a single-nucleotide variant (SNV) analysis on data from 52 samples with >90% SARS-CoV-2 genome coverage of sequence reads, and compared these SNVs with those detected in genomes sequenced from clinical patients. We identified 7973 SNVs, of which 5680 were “novel” SNVs that had not yet been identified in the global clinical-derived data as of 17th June 2020 (the day after our last wastewater sampling date). However, between 17th of June 2020 and 20th November 2020, almost half of the SNVs have since been detected in clinical-derived data. Using the combination of SNVs present in each sample, we identified the more probable lineages present in that sample and compared them to lineages observed in North America prior to our sampling dates. The wastewater-derived SARS-CoV-2 sequence data indicates there were more lineages circulating across the sampled communities than represented in the clinical-derived data. Principal coordinate analyses identified patterns in population structure based on genetic variation within the sequenced samples, with clear trends associated with increased diversity likely due to a higher number of infected individuals relative to the sampling dates. We demonstrate that genetic correlation analysis combined with SNVs analysis using wastewater sampling can provide a comprehensive snapshot of the SARS-CoV-2 genetic population structure circulating within a community, which might not be observed if relying solely on clinical cases.

Download Full-text

Pitfalls in supermatrix phylogenomics

European Journal of Taxonomy ◽

10.5852/ejt.2017.283 ◽

2017 ◽

Cited By ~ 13

Author(s):

Hervé Philippe ◽

Damien M. de Vienne ◽

Vincent Ranwez ◽

Béatrice Roure ◽

Denis Baurain ◽

...

Keyword(s):

Systematic Error ◽

Phylogenetic Trees ◽

Molecular Phylogenetics ◽

High Throughput Sequencing ◽

Sequence Data ◽

Single Gene ◽

Sequence Evolution ◽

Adequate Model ◽

Stochastic Error ◽

Genomic Scale

In the mid-2000s, molecular phylogenetics turned into phylogenomics, a development that improved the resolution of phylogenetic trees through a dramatic reduction in stochastic error. While some then predicted “the end of incongruence”, it soon appeared that analysing large amounts of sequence data without an adequate model of sequence evolution amplifies systematic error and leads to phylogenetic artefacts. With the increasing flood of (sometimes low-quality) genomic data resulting from the rise of high-throughput sequencing, a new type of error has emerged. Termed here “data errors”, it lumps together several kinds of issues affecting the construction of phylogenomic supermatrices (e.g., sequencing and annotation errors, contaminant sequences). While easy to deal with at a single-gene scale, such errors become very difficult to avoid at the genomic scale, both because hand curating thousands of sequences is prohibitively time-consuming and because the suitable automated bioinformatics tools are still in their infancy. In this paper, we first review the pitfalls affecting the construction of supermatrices and the strategies to limit their adverse effects on phylogenomic inference. Then, after discussing the relative non-issue of missing data in supermatrices, we briefly present the approaches commonly used to reduce systematic error.

Download Full-text