One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

AbstractMapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.Author summaryMapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species —a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. Biases/errors due to reference choice for mapping in bacteria have been identified. These are mainly originated in alignment errors due to genetic differences between the reference genome and the read sequences. Eventually, they could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). However, a systematic work on the effects of reference choice in different bacterial species is still missing, particularly regarding its impact on phylogenies. This work intended to fill that gap. The impact of reference choice has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.

Download Full-text

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008678 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008678

Author(s):

Carlos Valiente-Mullor ◽

Beatriz Beamud ◽

Iván Ansari ◽

Carlos Francés-Cuesta ◽

Neris García-González ◽

...

Keyword(s):

Legionella Pneumophila ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Reference Genome ◽

Sequence Data ◽

Genetic Distances ◽

Genomic Diversity ◽

Nucleotide Polymorphisms ◽

Recombination Rates ◽

Almost All

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

Download Full-text

High-Throughput Sequencing is a Crucial Tool to Investigate the Contribution of Human Endogenous Retroviruses (HERVs) to Human Biology and Development

Viruses ◽

10.3390/v12060633 ◽

2020 ◽

Vol 12 (6) ◽

pp. 633 ◽

Cited By ~ 1

Author(s):

Maria Paola Pisano ◽

Nicole Grandi ◽

Enzo Tramontano

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Developmental Stages ◽

Large Fraction ◽

Expression Patterns ◽

Cell Types ◽

Endogenous Retroviruses ◽

Human Endogenous Retroviruses ◽

Retroviral Infections ◽

The Impact

Human Endogenous retroviruses (HERVs) are remnants of ancient retroviral infections that represent a large fraction of our genome. Their transcriptional activity is finely regulated in early developmental stages and their expression is modulated in different cell types and tissues. Such activity has an impact on human physiology and pathology that is only partially understood up to date. Novel high-throughput sequencing tools have recently allowed for a great advancement in elucidating the various HERV expression patterns in different tissues as well as the mechanisms controlling their transcription, and overall, have helped in gaining better insights in an all-inclusive understanding of the impact of HERVs in biology of the host.

Download Full-text

Directed Culturing of Microorganisms Using Metatranscriptomics

mBio ◽

10.1128/mbio.00012-11 ◽

2011 ◽

Vol 2 (2) ◽

Cited By ~ 92

Author(s):

Lindsey Bomar ◽

Michele Maltz ◽

Sophie Colston ◽

Joerg Graf

Keyword(s):

High Throughput ◽

Culture Medium ◽

High Throughput Sequencing ◽

Bacterial Species ◽

Hydrolytic Enzymes ◽

Medicinal Leech ◽

Expression Data ◽

Rna Seq ◽

Rna Transcripts ◽

Uncultured Microorganisms

ABSTRACTThe vast majority of bacterial species remain uncultured, and this severely limits the investigation of their physiology, metabolic capabilities, and role in the environment. High-throughput sequencing of RNA transcripts (RNA-seq) allows the investigation of the diverse physiologies from uncultured microorganisms in their natural habitat. Here, we report the use of RNA-seq for characterizing the metatranscriptome of the simple gut microbiome from the medicinal leechHirudo verbanaand for utilizing this information to design a medium for cultivating members of the microbiome. Expression data suggested that aRikenella-like bacterium, the most abundant but uncultured symbiont, forages on sulfated- and sialated-mucin glycans that are fermented, leading to the secretion of acetate. Histological stains were consistent with the presence of sulfated and sialated mucins along the crop epithelium. The second dominant symbiont,Aeromonas veronii, grows in two different microenvironments and is predicted to utilize either acetate or carbohydrates. Based on the metatranscriptome, a medium containing mucin was designed, which enabled the cultivation of theRikenella-like bacterium. Metatranscriptomes shed light on microbial metabolismin situand provide critical clues for directing the culturing of uncultured microorganisms. By choosing a condition under which the desired organism is rapidly proliferating and focusing on highly expressed genes encoding hydrolytic enzymes, binding proteins, and transporters, one can identify an organism’s nutritional preferences and design a culture medium.IMPORTANCEThe number of prokaryotes on the planet has been estimated to exceed 1030cells, and the overwhelming majority of them have evaded cultivation, making it difficult to investigate their ecological, medical, and industrial relevance. The application of transcriptomics based on high-throughput sequencing of RNA transcripts (RNA-seq) to microorganisms in their natural environment can provide investigators with insight into their physiologies under optimal growth conditions. We utilized RNA-seq to learn more about the uncultured and cultured symbionts that comprise the relatively simple digestive-tract microbiome of the medicinal leech. The expression data revealed highly expressed hydrolytic enzymes and transporters that provided critical clues for the design of a culture medium enabling the isolation of the previously unculturedRikenella-like symbiont. This directed culturing method will greatly aid efforts aimed at understanding uncultured microorganisms, including beneficial symbionts, pathogens, and ecologically relevant microorganisms, by facilitating genome sequencing, physiological characterization, and genetic manipulation of the previously uncultured microbes.

Download Full-text

Deciphering transcriptional control mechanisms in hematopoiesis—The impact of high-throughput sequencing technologies

Experimental Hematology ◽

10.1016/j.exphem.2011.07.005 ◽

2011 ◽

Vol 39 (10) ◽

pp. 961-968 ◽

Cited By ~ 5

Author(s):

Nicola K. Wilson ◽

Marloes R. Tijssen ◽

Berthold Göttgens

Keyword(s):

High Throughput ◽

Transcriptional Control ◽

High Throughput Sequencing ◽

Control Mechanisms ◽

Sequencing Technologies ◽

The Impact

Download Full-text

Multi-Body-Site Microbiome and Culture Profiling of Military Trainees Suffering from Skin and Soft Tissue Infections at Fort Benning, Georgia

mSphere ◽

10.1128/msphere.00232-16 ◽

2016 ◽

Vol 1 (5) ◽

Cited By ~ 10

Author(s):

Jatinder Singh ◽

Ryan C. Johnson ◽

Carey D. Schlett ◽

Emad M. Elassal ◽

Katrina B. Crawford ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Bacterial Community Composition ◽

Soft Tissue Infections ◽

Microbial Composition ◽

Content Type ◽

Microbial Dysbiosis ◽

Lack Of Information ◽

The Impact ◽

Nasal Microbiota

ABSTRACT While it is evident that nasal colonization with S. aureus increases the likelihood of SSTI, there is a significant lack of information regarding the contribution of extranasal colonization to the overall risk of a subsequent SSTI. Furthermore, the impact of S. aureus colonization on bacterial community composition outside the nasal microbiota is unclear. Thus, this report represents the first investigation that utilized both culture and high-throughput sequencing techniques to analyze microbial dysbiosis at multiple body sites of healthy and diseased/colonized individuals. The results described here may be useful in the design of future methodologies to treat and prevent SSTIs. Skin and soft tissue infections (SSTIs) are common in the general population, with increased prevalence among military trainees. Previous research has revealed numerous nasal microbial signatures that correlate with SSTI development and Staphylococcus aureus colonization. Thus, we hypothesized that the ecology of the inguinal, oropharynx, and perianal regions may also be altered in response to SSTI and/or S. aureus colonization. We collected body site samples from 46 military trainees with purulent abscess (SSTI group) as well as from 66 asymptomatic controls (non-SSTI group). We also collected abscess cavity samples to assess the microbial composition of these infections. Samples were analyzed by culture, and the microbial communities were characterized by high-throughput sequencing. We found that the nasal, inguinal, and perianal regions were similar in microbial composition and significantly differed from the oropharynx. We also observed differences in Anaerococcus and Streptococcus abundance between the SSTI and non-SSTI groups for the nasal and oropharyngeal regions, respectively. Furthermore, we detected community membership differences between the SSTI and non-SSTI groups for the nasal and inguinal sites. Compared to that of the other regions, the microbial compositions of the nares of S. aureus carriers and noncarriers were dramatically different; we noted an inverse correlation between the presence of Corynebacterium and the presence of Staphylococcus in the nares. This correlation was also observed for the inguinal region. Culture analysis revealed elevated methicillin-resistant S. aureus (MRSA) colonization levels for the SSTI group in the nasal and inguinal body sites. Together, these data suggest significant microbial variability in patients with SSTI as well as between S. aureus carriers and noncarriers. IMPORTANCE While it is evident that nasal colonization with S. aureus increases the likelihood of SSTI, there is a significant lack of information regarding the contribution of extranasal colonization to the overall risk of a subsequent SSTI. Furthermore, the impact of S. aureus colonization on bacterial community composition outside the nasal microbiota is unclear. Thus, this report represents the first investigation that utilized both culture and high-throughput sequencing techniques to analyze microbial dysbiosis at multiple body sites of healthy and diseased/colonized individuals. The results described here may be useful in the design of future methodologies to treat and prevent SSTIs.

Download Full-text

A Tree of Human Gut Bacterial Species and its Applications to Metagenomics and Metaproteomics Data Analysis

10.1101/2020.09.24.311720 ◽

2020 ◽

Author(s):

Moses Stamboulian ◽

Thomas G. Doak ◽

Yuzhen Ye

Keyword(s):

Gut Microbiome ◽

Phylogenetic Trees ◽

Bacterial Species ◽

Taxonomic Composition ◽

Marker Genes ◽

Missing Information ◽

Human Gut ◽

Taxonomic Profiling ◽

Tree Building ◽

The Impact

Abstract1BackgroundRecent advances in genome and metagenome sequencing have dramatically enriched the collection of genomes of bacterial species related to human health and diseases. In metagenomic studies phylogenetic trees are commonly used to depict, describe, and compare the bacterial members of the community under study. The most accurate tree-building algorithms now use large sets of marker genes taken from across genomes. However, many of the current bacterial genomes were assembled from metagenomic datasets (i.e., metagenome assembled genomes, MAGs), and often contain missing information. It is therefore important to study how well the phylogeny approach performs on such genomes. Further, phylogeny methods are not perfect and it is important to know how reliable an inferred tree is.ResultsHere we examined the impact of incompleteness of the genomes on the tree reconstruction, and we showed that phylogeny approaches including RAxML (which handles missing data explicitly) and FastTree generally performed well on simulated collection of 400 genomes with missing information. As RAxML is computationally prohibitive for the much larger collections of gut genomes, we chose FastTree to build a unified tree of human-gut associated bacterial species (referred to as gut tree), including more than 3000 genomes, most of which are incomplete. We developed two downstream applications of the gut tree: peptide-centric analysis of metaproteomics datasets; and taxonomic characterization of metagenomic sequences. In both applications, the gut tree provided the basis for quantification of species composition at various taxonomic resolutions.ConclusionsThe gut tree presented in this study provides a useful framework for taxonomic profiling of human gut microbiome. Including MAGs in the tree provides more comprehensive representation of microbial species diversity associated with human gut, important for studying the taxonomic composition of gut microbiome.Availability and ImplementationThe tree construction pipeline and downstream applications of the gut tree are freely available at https://github.com/mgtools/guttree.

Download Full-text

jackalope: a swift, versatile phylogenomic and high-throughput sequencing simulator

10.1101/650747 ◽

2019 ◽

Author(s):

Lucas A. Nell

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Population Genomics ◽

R Package ◽

Gene Trees ◽

Sequencing Platform ◽

Genomic Variants ◽

Pacific Biosciences ◽

Wide Range ◽

Reference Genomes

AbstractHigh-throughput sequencing (HTS) is central to the study of population genomics and has an increasingly important role in constructing phylogenies. Choices in research design for sequencing projects can include a wide range of factors, such as sequencing platform, depth of coverage, and bioinformatic tools. Simulating HTS data better informs these decisions. However, current standalone HTS simulators cannot generate genomic variants under even somewhat complex evolutionary scenarios, which greatly reduces their usefulness for fields such as population genomics and phylogenomics. Here I present the R package jackalope that simply and efficiently simulates (i) variants from reference genomes and (ii) reads from both Illumina and Pacific Biosciences (PacBio) platforms. Genomic variants can be simulated using phylogenies, gene trees, coalescent-simulation output, population-genomic summary statistics, and Variant Call Format (VCF) files. jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as reads from Pacific Biosciences. These simulations include sequencing errors, mapping qualities, multiplexing, and optical/PCR duplicates. It can read reference genomes from FASTA files and can simulate new ones, and all outputs can be written to standard file formats. jackalope is available for Mac, Windows, and Linux systems.

Download Full-text

SMRT Genome Assembly Corrects Reference Errors, Resolving the Genetic Basis of Virulence in Mycobacterium tuberculosis

10.1101/064840 ◽

2016 ◽

Author(s):

Afif Elghraoui ◽

Samuel J Modlin ◽

Faramarz Valafar

Keyword(s):

Mycobacterium Tuberculosis ◽

Single Molecule ◽

Genetic Basis ◽

Reference Genome ◽

Reference Sequence ◽

Smrt Sequencing ◽

Virulence Attenuation ◽

Sequencing Platforms ◽

Genome Comparisons ◽

Reference Genomes

AbstractThe genetic basis of virulence in Mycobacterium tuberculosis has been investigated through genome comparisons of its virulent (H37Rv) and attenuated (H37Ra) sister strains. Such analysis, however, relies heavily on the accuracy of the sequences. While the H37Rv reference genome has had several corrections to date, that of H37Ra is unmodified since its original publication. Here, we report the assembly and finishing of the H37Ra genome from single-molecule, real-time (SMRT) sequencing. Our assembly reveals that the number of H37Ra-specific variants is less than half of what the Sanger-based H37Ra reference sequence indicates, undermining and, in some cases, invalidating the conclusions of several studies. PE_PPE family genes, which are intractable to commonly-used sequencing platforms because of their repetitive and GC-rich nature, are overrepresented in the set of genes in which all reported H37Ra-specific variants are contradicted. We discuss how our results change the picture of virulence attenuation and the power of SMRT sequencing for producing high-quality reference genomes.

Download Full-text

sppIDer: a species identification tool to investigate hybrid genomes with high-throughput sequencing

10.1101/333815 ◽

2018 ◽

Cited By ~ 1

Author(s):

Quinn K. Langdon ◽

David Peris ◽

Brian Kyle ◽

Chris Todd Hittinger

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Rapid Identification ◽

Sequencing Data ◽

Pure Species ◽

High Throughput Sequencing Data ◽

Interspecies Hybrids ◽

Evolutionary Trajectories ◽

Low Coverage ◽

Reference Genomes

AbstractThe genomics era has expanded our knowledge about the diversity of the living world, yet harnessing high-throughput sequencing data to investigate alternative evolutionary trajectories, such as hybridization, is still challenging. Here we present sppIDer, a pipeline for the characterization of interspecies hybrids and pure species,that illuminates the complete composition of genomes. sppIDer maps short-read sequencing data to a combination genome built from reference genomes of several species of interest and assesses the genomic contribution and relative ploidy of each parental species, producing a series of colorful graphical outputs ready for publication. As a proof-of-concept, we use the genus Saccharomyces to detect and visualize both interspecies hybrids and pure strains, even with missing parental reference genomes. Through simulation, we show that sppIDer is robust to variable reference genome qualities and performs well with low-coverage data. We further demonstrate the power of this approach in plants, animals, and other fungi. sppIDer is robust to many different inputs and provides visually intuitive insight into genome composition that enables the rapid identification of species and their interspecies hybrids. sppIDer exists as a Docker image, which is a reusable, reproducible, transparent, and simple-to-run package that automates the pipeline and installation of the required dependencies (https://github.com/GLBRC/sppIDer).

Download Full-text

Impact of human gene annotations on RNA-seq differential expression analysis

10.21203/rs.3.rs-301856/v1 ◽

2021 ◽

Author(s):

Yu Hamaguchi ◽

Chao Zeng ◽

Michiaki Hamada

Keyword(s):

Differential Expression ◽

High Throughput ◽

High Throughput Sequencing ◽

Human Gene ◽

Gene Annotation ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Annotations ◽

Sequencing Technologies ◽

The Impact

Abstract Background: Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated–a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear.Results: Using “mappability”, a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically.Conclusions: We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.

Download Full-text