scholarly journals V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data

Author(s):  
Susana Posada-Céspedes ◽  
David Seifert ◽  
Ivan Topolsky ◽  
Karin J. Metzner ◽  
Niko Beerenwinkel

AbstractHigh-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. V-pipe is freely available at https://github.com/cbg-ethz/V-pipe.

2021 ◽  
Author(s):  
Simone Marini ◽  
Rodrigo Mora ◽  
Christina Boucher ◽  
Noelle Noyes ◽  
Mattia Prosperi

Antimicrobial resistance (AMR) is a growing threat to public health and farming at large. Without appropriate interventions, it can lead to millions of deaths per year and substantial economic loss worldwide. In clinical and veterinary practice, a timely characterization of the antibiotic susceptibility profile of bacterial infections is a crucial step in optimizing treatment. Fast turnaround of AMR testing is also needed in food safety and infection control surveillance (e.g., contamination of healthcare or long-term nursing facilities). High-throughput sequencing is a promising option for clinical point-of-care and ecological surveillance, opening the opportunity to develop genotyping-based AMR determination as a possibly faster alternative to phenotypic testing. In the present work, we compare the performance of state-of-the-art methods for detection of AMR from high-throughput sequencing data in healthcare settings. We consider five complementary computational approaches --alignment (AMRPlusPlus), deep learning (DeepARG), k-mer genomic signatures (KARGA, ResFinder), and hidden Markov models (Meta-MARC). We use an extensive collection of clinical studies never employed for model training. To do so, we assemble data from multiple, independent AMR high-throughput sequencing experiments collected in a variety of hospital settings, comprising of 585 isolates with a available AMR resistance profiles determined by phenotypic tests across nine antibiotic classes. We show how the prediction landscape of AMR classifiers is highly heterogeneous, with balanced accuracy varying from 0.4 to 0.92. Although some algorithms---ResFinder, KARGA, and AMRPlusPlus-- exhibit overall better balanced accuracy than others, the high per-AMR-class variance and related findings suggest that: (1) all algorithms might be subject to sampling bias present both in data repositories used for training and experimental/clinical settings; and (2) a portion of clinical samples might contain uncharacterized AMR genes that the algorithms---mostly trained on known AMR genes---fail to generalize upon. These results lead us to formulate practical advice for software configuration and application, as well as give suggestions for future study design to further develop AMR prediction tools from proof-of-concept to bedside.


2014 ◽  
Vol 2014 ◽  
pp. 1-12 ◽  
Author(s):  
Preston Leung ◽  
Rowena Bull ◽  
Andrew Lloyd ◽  
Fabio Luciani

Rapidly mutating viruses, such as hepatitis C virus (HCV) and HIV, have adopted evolutionary strategies that allow escape from the host immune response via genomic mutations. Recent advances in high-throughput sequencing are reshaping the field of immuno-virology of viral infections, as these allow fast and cheap generation of genomic data. However, due to the large volumes of data generated, a thorough understanding of the biological and immunological significance of such information is often difficult. This paper proposes a pipeline that allows visualization and statistical analysis of viral mutations that are associated with immune escape. Taking next generation sequencing data from longitudinal analysis of HCV viral genomes during a single HCV infection, along with antigen specific T-cell responses detected from the same subject, we demonstrate the applicability of these tools in the context of primary HCV infection. We provide a statistical and visual explanation of the relationship between cooccurring mutations on the viral genome and the parallel adaptive immune response against HCV.


2018 ◽  
Vol 70 (4) ◽  
pp. 1069-1076 ◽  
Author(s):  
Aleksia Vaattovaara ◽  
Johanna Leppälä ◽  
Jarkko Salojärvi ◽  
Michael Wrzaczek

2021 ◽  
Vol 99 (2) ◽  
Author(s):  
Yuhua Fu ◽  
Pengyu Fan ◽  
Lu Wang ◽  
Ziqiang Shu ◽  
Shilin Zhu ◽  
...  

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.


2021 ◽  
Author(s):  
H. Serhat Tetikol ◽  
Kubra Narci ◽  
Deniz Turgut ◽  
Gungor Budak ◽  
Ozem Kalay ◽  
...  

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.


MycoKeys ◽  
2018 ◽  
Vol 39 ◽  
pp. 29-40 ◽  
Author(s):  
Sten Anslan ◽  
R. Henrik Nilsson ◽  
Christian Wurzbacher ◽  
Petr Baldrian ◽  
Leho Tedersoo ◽  
...  

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.


Viruses ◽  
2019 ◽  
Vol 11 (9) ◽  
pp. 806
Author(s):  
Shambhu G. Aralaguppe ◽  
Anoop T. Ambikan ◽  
Manickam Ashokkumar ◽  
Milner M. Kumar ◽  
Luke Elizabeth Hanna ◽  
...  

The detection of drug resistance mutations (DRMs) in minor viral populations is of potential clinical importance. However, sophisticated computational infrastructure and competence for analysis of high-throughput sequencing (HTS) data lack at most diagnostic laboratories. Thus, we have proposed a new pipeline, MiDRMpol, to quantify DRM from the HIV-1 pol region. The gag-vpu region of 87 plasma samples from HIV-infected individuals from three cohorts was amplified and sequenced by Illumina HiSeq2500. The sequence reads were adapter-trimmed, followed by analysis using in-house scripts. Samples from Swedish and Ethiopian cohorts were also sequenced by Sanger sequencing. The pipeline was validated against the online tool PASeq (Polymorphism Analysis by Sequencing). Based on an error rate of <1%, a value of >1% was set as reliable to consider a minor variant. Both pipelines detected the mutations in the dominant viral populations, while discrepancies were observed in minor viral populations. In five HIV-1 subtype C samples, minor mutations were detected at the <5% level by MiDRMpol but not by PASeq. MiDRMpol is a computationally as well as labor efficient bioinformatics pipeline for the detection of DRM from HTS data. It identifies minor viral populations (<20%) of DRMs. Our method can be incorporated into large-scale surveillance of HIV-1 DRM.


Sign in / Sign up

Export Citation Format

Share Document