V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data

AbstractHigh-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. V-pipe is freely available at https://github.com/cbg-ethz/V-pipe.

Download Full-text

The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data

BMC Genomics ◽

10.1186/s12864-017-4022-x ◽

2017 ◽

Vol 18 (S6) ◽

Cited By ~ 16

Author(s):

Yan Guo ◽

Shilin Zhao ◽

Quanhu Sheng ◽

David C Samuels ◽

Yu Shyr

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Dna And Rna ◽

High Throughput Sequencing Data

Download Full-text

Towards routine employment of computational tools for antimicrobial resistance determination via high-throughput sequencing

10.1101/2021.11.03.467126 ◽

2021 ◽

Author(s):

Simone Marini ◽

Rodrigo Mora ◽

Christina Boucher ◽

Noelle Noyes ◽

Mattia Prosperi

Keyword(s):

Antimicrobial Resistance ◽

High Throughput ◽

Bacterial Infections ◽

High Throughput Sequencing ◽

Markov Models ◽

Point Of Care ◽

Economic Loss ◽

Clinical Samples ◽

Sequencing Data ◽

Data Repositories

Antimicrobial resistance (AMR) is a growing threat to public health and farming at large. Without appropriate interventions, it can lead to millions of deaths per year and substantial economic loss worldwide. In clinical and veterinary practice, a timely characterization of the antibiotic susceptibility profile of bacterial infections is a crucial step in optimizing treatment. Fast turnaround of AMR testing is also needed in food safety and infection control surveillance (e.g., contamination of healthcare or long-term nursing facilities). High-throughput sequencing is a promising option for clinical point-of-care and ecological surveillance, opening the opportunity to develop genotyping-based AMR determination as a possibly faster alternative to phenotypic testing. In the present work, we compare the performance of state-of-the-art methods for detection of AMR from high-throughput sequencing data in healthcare settings. We consider five complementary computational approaches --alignment (AMRPlusPlus), deep learning (DeepARG), k-mer genomic signatures (KARGA, ResFinder), and hidden Markov models (Meta-MARC). We use an extensive collection of clinical studies never employed for model training. To do so, we assemble data from multiple, independent AMR high-throughput sequencing experiments collected in a variety of hospital settings, comprising of 585 isolates with a available AMR resistance profiles determined by phenotypic tests across nine antibiotic classes. We show how the prediction landscape of AMR classifiers is highly heterogeneous, with balanced accuracy varying from 0.4 to 0.92. Although some algorithms---ResFinder, KARGA, and AMRPlusPlus-- exhibit overall better balanced accuracy than others, the high per-AMR-class variance and related findings suggest that: (1) all algorithms might be subject to sampling bias present both in data repositories used for training and experimental/clinical settings; and (2) a portion of clinical samples might contain uncharacterized AMR genes that the algorithms---mostly trained on known AMR genes---fail to generalize upon. These results lead us to formulate practical advice for software configuration and application, as well as give suggestions for future study design to further develop AMR prediction tools from proof-of-concept to bedside.

Download Full-text

A Bioinformatics Pipeline for the Analyses of Viral Escape Dynamics and Host Immune Responses during an Infection

BioMed Research International ◽

10.1155/2014/264519 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Preston Leung ◽

Rowena Bull ◽

Andrew Lloyd ◽

Fabio Luciani

Keyword(s):

Immune Response ◽

High Throughput Sequencing ◽

Viral Infections ◽

Immune Escape ◽

Hcv Infection ◽

Viral Escape ◽

Evolutionary Strategies ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Bioinformatics Pipeline

Rapidly mutating viruses, such as hepatitis C virus (HCV) and HIV, have adopted evolutionary strategies that allow escape from the host immune response via genomic mutations. Recent advances in high-throughput sequencing are reshaping the field of immuno-virology of viral infections, as these allow fast and cheap generation of genomic data. However, due to the large volumes of data generated, a thorough understanding of the biological and immunological significance of such information is often difficult. This paper proposes a pipeline that allows visualization and statistical analysis of viral mutations that are associated with immune escape. Taking next generation sequencing data from longitudinal analysis of HCV viral genomes during a single HCV infection, along with antigen specific T-cell responses detected from the same subject, we demonstrate the applicability of these tools in the context of primary HCV infection. We provide a statistical and visual explanation of the relationship between cooccurring mutations on the viral genome and the parallel adaptive immune response against HCV.

Download Full-text

High-throughput sequencing data and the impact of plant gene annotation quality

Journal of Experimental Botany ◽

10.1093/jxb/ery434 ◽

2018 ◽

Vol 70 (4) ◽

pp. 1069-1076 ◽

Cited By ~ 4

Author(s):

Aleksia Vaattovaara ◽

Johanna Leppälä ◽

Jarkko Salojärvi ◽

Michael Wrzaczek

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Gene Annotation ◽

Sequencing Data ◽

Plant Gene ◽

High Throughput Sequencing Data ◽

The Impact ◽

Annotation Quality

Download Full-text

Faculty Opinions recommendation of Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726132071.793531014 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Hiv Infection ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btu010 ◽

2014 ◽

Vol 30 (9) ◽

pp. 1214-1219 ◽

Cited By ~ 6

Author(s):

C. Ye ◽

C. Hsiao ◽

H. Corrada Bravo

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Blind Deconvolution ◽

Sequencing Data ◽

Base Calling ◽

High Throughput Sequencing Data

Download Full-text

Improvement, identification, and target prediction for miRNAs in the porcine genome by using massive, public high-throughput sequencing data

Journal of Animal Science ◽

10.1093/jas/skab018 ◽

2021 ◽

Vol 99 (2) ◽

Author(s):

Yuhua Fu ◽

Pengyu Fan ◽

Lu Wang ◽

Ziqiang Shu ◽

Shilin Zhu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Target Prediction ◽

Large Data ◽

Sequencing Data ◽

Regulate Gene Expression ◽

High Throughput Sequencing Data ◽

Annotation Information ◽

Public Data ◽

Broad Variety

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

MycoKeys ◽

10.3897/mycokeys.39.28109 ◽

2018 ◽

Vol 39 ◽

pp. 29-40 ◽

Cited By ~ 21

Author(s):

Sten Anslan ◽

R. Henrik Nilsson ◽

Christian Wurzbacher ◽

Petr Baldrian ◽

Leho Tedersoo ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Computation Time ◽

Potential Effect ◽

Data Sets ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

High Throughput Sequencing Data ◽

Recent Developments

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Download Full-text

MiDRMpol: A High-Throughput Multiplexed Amplicon Sequencing Workflow to Quantify HIV-1 Drug Resistance Mutations against Protease, Reverse Transcriptase, and Integrase Inhibitors

Viruses ◽

10.3390/v11090806 ◽

2019 ◽

Vol 11 (9) ◽

pp. 806

Author(s):

Shambhu G. Aralaguppe ◽

Anoop T. Ambikan ◽

Manickam Ashokkumar ◽

Milner M. Kumar ◽

Luke Elizabeth Hanna ◽

...

Keyword(s):

Drug Resistance ◽

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Polymorphism Analysis ◽

Resistance Mutations ◽

Subtype C ◽

Bioinformatics Pipeline ◽

Drug Resistance Mutations ◽

Hiv 1

The detection of drug resistance mutations (DRMs) in minor viral populations is of potential clinical importance. However, sophisticated computational infrastructure and competence for analysis of high-throughput sequencing (HTS) data lack at most diagnostic laboratories. Thus, we have proposed a new pipeline, MiDRMpol, to quantify DRM from the HIV-1 pol region. The gag-vpu region of 87 plasma samples from HIV-infected individuals from three cohorts was amplified and sequenced by Illumina HiSeq2500. The sequence reads were adapter-trimmed, followed by analysis using in-house scripts. Samples from Swedish and Ethiopian cohorts were also sequenced by Sanger sequencing. The pipeline was validated against the online tool PASeq (Polymorphism Analysis by Sequencing). Based on an error rate of <1%, a value of >1% was set as reliable to consider a minor variant. Both pipelines detected the mutations in the dominant viral populations, while discrepancies were observed in minor viral populations. In five HIV-1 subtype C samples, minor mutations were detected at the <5% level by MiDRMpol but not by PASeq. MiDRMpol is a computationally as well as labor efficient bioinformatics pipeline for the detection of DRM from HTS data. It identifies minor viral populations (<20%) of DRMs. Our method can be incorporated into large-scale surveillance of HIV-1 DRM.

Download Full-text