Using QC-Blind for quality control and contamination screening of bacteria DNA sequencing data without reference genome

ABSTRACTQuality control in next generation sequencing has become increasingly important as the technique becomes widely used. Tools have been developed for filtering possible contaminants in the sequencing data of species with known reference genome. Unfortunately, reference genomes for all the species involved, including the contaminants, are required for these tools to work. This precludes many real-life samples that have no information about the complete genome of the target species, and are contaminated with unknown microbial species.In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. The pipeline requires only very little information from the marker genes of the target species. The entire pipeline consists of unsupervised read assembly, contig binning, read clustering and marker gene assignment.When evaluated onin silico,ab initioandin vivodatasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind could serve well in situations where limited information is available for both target and contamination species.IMPORTANCEAt present, many sequencing projects are still performed on potentially contaminated samples, which bring into question their accuracies. However, current reference-based quality control method are limited as they need either the genome of target species or contaminations. In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. When evaluated onin silico,ab initioandin vivodatasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind is suitable for real-life samples where limited information is available for both target and contamination species.

Download Full-text

Reference flow: reducing reference bias using multiple population genomes

Genome Biology ◽

10.1186/s13059-020-02229-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

In Vivo Action of the HRD Ubiquitin Ligase Complex: Mechanisms of Endoplasmic Reticulum Quality Control and Sterol Regulation

Molecular and Cellular Biology ◽

10.1128/mcb.21.13.4276-4291.2001 ◽

2001 ◽

Vol 21 (13) ◽

pp. 4276-4291 ◽

Cited By ~ 93

Author(s):

Richard G. Gardner ◽

Alexander G. Shearer ◽

Randolph Y. Hampton

Keyword(s):

Quality Control ◽

Endoplasmic Reticulum ◽

Ubiquitin Ligase ◽

Sequence Similarity ◽

High Specificity ◽

Specific Sequence ◽

Ubiquitin Ligase Complex ◽

Ubiquitin Conjugating Enzyme ◽

Conjugating Enzyme

ABSTRACT Ubiquitination is used to target both normal proteins for specific regulated degradation and misfolded proteins for purposes of quality control destruction. Ubiquitin ligases, or E3 proteins, promote ubiquitination by effecting the specific transfer of ubiquitin from the correct ubiquitin-conjugating enzyme, or E2 protein, to the target substrate. Substrate specificity is usually determined by specific sequence determinants, or degrons, in the target substrate that are recognized by the ubiquitin ligase. In quality control, however, a potentially vast collection of proteins with characteristic hallmarks of misfolding or misassembly are targeted with high specificity despite the lack of any sequence similarity between substrates. In order to understand the mechanisms of quality control ubiquitination, we have focused our attention on the first characterized quality control ubiquitin ligase, the HRD complex, which is responsible for the endoplasmic reticulum (ER)-associated degradation (ERAD) of numerous ER-resident proteins. Using an in vivo cross-linking assay, we directly examined the association of the separate HRDcomplex components with various ERAD substrates. We have discovered that the HRD ubiquitin ligase complex associates with both ERAD substrates and stable proteins, but only mediates ubiquitin-conjugating enzyme association with ERAD substrates. Our studies with the sterol pathway-regulated ERAD substrate Hmg2p, an isozyme of the yeast cholesterol biosynthetic enzyme HMG-coenzyme A reductase (HMGR), indicated that the HRD complex discerns between a degradation-competent “misfolded” state and a stable, tightly folded state. Thus, it appears that the physiologically regulated, HRD-dependent degradation of HMGR is effected by a programmed structural transition from a stable protein to a quality control substrate.

Download Full-text

Reducing reference bias using multiple population reference genomes

10.1101/2020.03.03.975219 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance, but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

Novel functional sequences uncovered through a bovine multiassembly graph

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2101056118 ◽

2021 ◽

Vol 118 (20) ◽

pp. e2101056118

Author(s):

Danang Crysnanto ◽

Alexander S. Leonard ◽

Zih-Hua Fang ◽

Hubert Pausch

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Bos Taurus ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Reference Allele ◽

Genomic Analyses ◽

Reference Quality ◽

Reference Genomes

Many genomic analyses start by aligning sequencing reads to a linear reference genome. However, linear reference genomes are imperfect, lacking millions of bases of unknown relevance and are unable to reflect the genetic diversity of populations. This makes reference-guided methods susceptible to reference-allele bias. To overcome such limitations, we build a pangenome from six reference-quality assemblies from taurine and indicine cattle as well as yak. The pangenome contains an additional 70,329,827 bases compared to the Bos taurus reference genome. Our multiassembly approach reveals 30 and 10.1 million bases private to yak and indicine cattle, respectively, and between 3.3 and 4.4 million bases unique to each taurine assembly. Utilizing transcriptomes from 56 cattle, we show that these nonreference sequences encode transcripts that hitherto remained undetected from the B. taurus reference genome. We uncover genes, primarily encoding proteins contributing to immune response and pathogen-mediated immunomodulation, differentially expressed between Mycobacterium bovis–infected and noninfected cattle that are also undetectable in the B. taurus reference genome. Using whole-genome sequencing data of cattle from five breeds, we show that reads which were previously misaligned against the Bos taurus reference genome now align accurately to the pangenome sequences. This enables us to discover 83,250 polymorphic sites that segregate within and between breeds of cattle and capture genetic differentiation across breeds. Our work makes a so-far unused source of variation amenable to genetic investigations and provides methods and a framework for establishing and exploiting a more diverse reference genome.

Download Full-text

Effects of sodium houttuyfonate on transcriptome of Pseudomonas aeruginosa

BMC Research Notes ◽

10.1186/s13104-019-4721-2 ◽

2019 ◽

Vol 12 (1) ◽

Cited By ~ 1

Author(s):

Yeye Zhao ◽

Yuanqing Si ◽

Longfei Mei ◽

Jiadi Wu ◽

Jing Shao ◽

...

Keyword(s):

Pseudomonas Aeruginosa ◽

Quality Control ◽

Differentially Expressed Genes ◽

Reference Genome ◽

Negative Control ◽

Differentially Expressed ◽

Rna Seq ◽

Sequencing Data ◽

Control Groups ◽

Data Description

Abstract Objectives The purpose of this experiment is to analyze the changes of transcriptome in Pseudomonas aeruginosa under the action of sodium houttuyfonate (SH) to reveal the possible mechanism of SH inhibiting P. aeruginosa. We analyzed these data in order to compare the transcriptomic differences of P. aeruginosa in SH treatment and blank control groups. Data description In this project, RNA-seq of BGISEQ-500 platform was used to sequence the transcriptome of P. aeruginosa, and sequencing data of 8 samples of P. aeruginosa are generated as follows: SH treatment (SH1, SH2, SH3, SH4), negative control (Control 1, Control 2, Control 3, Control 4). Quality control is carried out on raw reads to determine whether the sequencing data is suitable for subsequent analysis. Totally 170.53 MB of transcriptome sequencing data is obtained. Then the filtered clean reads are aligned and compared to the reference genome to proceed second quality control. After completion, 5938 genes are assembled from sequencing data. Further quantitative analysis of genes and screening of differentially expressed genes based on gene expression level reveals that there are 2047 significantly differentially expressed genes under SH treatment, including 368 up-regulated genes and 1679 down-regulated genes.

Download Full-text

Efficient detection of transposable element insertion polymorphisms between genomes using short-read sequencing data

10.1101/2020.06.09.142331 ◽

2020 ◽

Author(s):

P. Baduel ◽

L. Quadrana ◽

V. Colot

Keyword(s):

High Specificity ◽

Major Effect ◽

Sequencing Data ◽

Low Frequencies ◽

Short Read ◽

Short Read Sequencing ◽

Efficient Detection ◽

Specificity And Sensitivity ◽

A Minor ◽

Reference Genomes

AbstractTransposable elements (TEs) are powerful generators of major-effect mutations, most of which are deleterious at the species level and maintained at very low frequencies within populations. As reference genomes can only capture a minor fraction of such variants, methods were developed to detect TE insertion polymorphisms (TIPs) in non-reference genomes from short-read sequencing data, which are becoming increasingly available. We present here a bioinformatic framework combining an improved version of the SPLITREADER and TEPID pipelines to detect non-reference TE presence and reference TE absence variants, respectively. We benchmark our method on ten non-reference Arabidopsis thaliana genomes and demonstrate its high specificity and sensitivity in the detection of TIPs between genomes.

Download Full-text

Reference genome and annotation updates lead to contradictory prognostic predictions in gene expression signatures: a case study of resected stage I lung adenocarcinoma

Briefings in Bioinformatics ◽

10.1093/bib/bbaa081 ◽

2020 ◽

Author(s):

Zheyang Zhang ◽

Sainan Zhang ◽

Xin Li ◽

Zhangxiang Zhao ◽

Changjing Chen ◽

...

Keyword(s):

Lung Adenocarcinoma ◽

Rna Sequencing ◽

Reference Genome ◽

Low Cost ◽

Risk Classification ◽

Stage I ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Public Data ◽

Reference Genomes

Abstract RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20–v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.

Download Full-text

Accurate sequence variant genotyping in cattle using variation-aware genome graphs

10.1101/460345 ◽

2018 ◽

Cited By ~ 1

Author(s):

Danang Crysnanto ◽

Christine Wurmser ◽

Hubert Pausch

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Sequence Data ◽

Sequence Variant ◽

Sequence Variants ◽

Sequencing Data ◽

Reference Allele ◽

Reference Genomes ◽

Genome Graph ◽

Genotype Concordance

Background: The genotyping of sequence variants typically involves as a first step the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of sequence variation within a species, reference allele bias may occur at highly polymorphic or diverged regions of the genome. Graph-based methods facilitate to compare sequencing reads to a variation-aware genome graph that incorporates non-redundant DNA sequences that segregate within a species. We compared accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely used methods, i.e., GATK and SAMtools, that rely on linear reference genomes using whole-genomes sequencing data of 49 Original Braunvieh cattle. Results: We discovered 21,140,196, 20,262,913 and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the lowest number of mendelian inconsistencies for both SNPs and indels in nine sire-son pairs with sequence data. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all tools evaluated particularly for animals that have been sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24 % for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved the genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but it required less than GATK. Conclusions: Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants which is not possible with the current implementations of state-of-the-art methods that rely on linear reference genomes.

Download Full-text

Host-pathogen dynamics in longitudinal clinical specimens from patients with COVID-19

10.1101/2021.04.27.21256149 ◽

2021 ◽

Author(s):

Michelle J. Lin ◽

Victoria M. Rachleff ◽

Hong Xie ◽

Lasata Shrestha ◽

Nicole A.P. Lieberman ◽

...

Keyword(s):

Bacterial Species ◽

Viral Evolution ◽

Sequencing Data ◽

Low Frequencies ◽

Consensus Sequences ◽

Viral Genomes ◽

Viral Loads ◽

Public Repositories ◽

Over Time

AbstractBackgroundRapid dissemination of SARS-CoV-2 sequencing data to public repositories has enabled widespread study of viral genomes, but studies of longitudinal specimens from infected persons are relatively limited. Analysis of longitudinal specimens enables understanding of how host immune pressures drive viral evolution in vivo.Methods and findingsHere we performed sequencing of 49 longitudinal SARS-CoV-2-positive samples from 20 patients in Washington State collected between March and September of 2020. Viral loads declined over time with an average increase in RT-PCR cycle threshold (Ct) of 0.87 per day. We found that there was negligible change in SARS-CoV-2 consensus sequences over time, but identified a number of nonsynonymous variants at low frequencies across the genome. We observed enrichment for a relatively small number of these variants, all of which are now seen in consensus genomes across the globe at low prevalence. In one patient, we saw rapid emergence of various low-level deletion variants at the N-terminal domain of the spike glycoprotein, some of which have previously been shown to be associated with reduced neutralization potency from sera. In a subset of samples that were sequenced using metagenomic methods, differential gene expression analysis showed a downregulation of cytoskeletal genes that was consistent with a loss of ciliated epithelium during infection and recovery. We also identified co-occurrence of bacterial species in samples from multiple hospitalized individuals.ConclusionsThese results demonstrate that the intrahost genetic composition of SARS-CoV-2 is dynamic during the course of COVID-19, and highlight the need for continued surveillance and deep sequencing of minor variants.

Download Full-text