A Bioinformatics Pipeline for the Analyses of Viral Escape Dynamics and Host Immune Responses during an Infection

Rapidly mutating viruses, such as hepatitis C virus (HCV) and HIV, have adopted evolutionary strategies that allow escape from the host immune response via genomic mutations. Recent advances in high-throughput sequencing are reshaping the field of immuno-virology of viral infections, as these allow fast and cheap generation of genomic data. However, due to the large volumes of data generated, a thorough understanding of the biological and immunological significance of such information is often difficult. This paper proposes a pipeline that allows visualization and statistical analysis of viral mutations that are associated with immune escape. Taking next generation sequencing data from longitudinal analysis of HCV viral genomes during a single HCV infection, along with antigen specific T-cell responses detected from the same subject, we demonstrate the applicability of these tools in the context of primary HCV infection. We provide a statistical and visual explanation of the relationship between cooccurring mutations on the viral genome and the parallel adaptive immune response against HCV.

Download Full-text

V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data

10.1101/2020.06.09.142919 ◽

2020 ◽

Cited By ~ 4

Author(s):

Susana Posada-Céspedes ◽

David Seifert ◽

Ivan Topolsky ◽

Karin J. Metzner ◽

Niko Beerenwinkel

Keyword(s):

Genetic Diversity ◽

High Throughput ◽

High Throughput Sequencing ◽

Viral Infections ◽

Markov Models ◽

Large Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Bioinformatics Pipeline ◽

The Impact

AbstractHigh-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. V-pipe is freely available at https://github.com/cbg-ethz/V-pipe.

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text

Adaptive Immune Response against Hepatitis C Virus

International Journal of Molecular Sciences ◽

10.3390/ijms21165644 ◽

2020 ◽

Vol 21 (16) ◽

pp. 5644

Author(s):

Janine Kemming ◽

Robert Thimme ◽

Christoph Neumann-Haefelin

Keyword(s):

Immune Response ◽

Hepatitis C Virus ◽

Hepatitis C ◽

Persistent Infection ◽

Adaptive Immune Response ◽

Hcv Infection ◽

Viral Escape ◽

Continuous Exposure ◽

Adaptive Immune ◽

Adaptive Immune Cells

A functional adaptive immune response is the major determinant for clearance of hepatitis C virus (HCV) infection. However, in the majority of patients, this response fails and persistent infection evolves. Here, we dissect the HCV-specific key players of adaptive immunity, namely B cells and T cells, and describe factors that affect infection outcome. Once chronic infection is established, continuous exposure to HCV antigens affects functionality, phenotype, transcriptional program, metabolism, and the epigenetics of the adaptive immune cells. In addition, viral escape mutations contribute to the failure of adaptive antiviral immunity. Direct-acting antivirals (DAA) can mediate HCV clearance in almost all patients with chronic HCV infection, however, defects in adaptive immune cell populations remain, only limited functional memory is obtained and reinfection of cured individuals is possible. Thus, to avoid potential reinfection and achieve global elimination of HCV infections, a prophylactic vaccine is needed. Recent vaccine trials could induce HCV-specific immunity but failed to protect from persistent infection. Thus, lessons from natural protection from persistent infection, DAA-mediated cure, and non-protective vaccination trials might lead the way to successful vaccination strategies in the future.

Download Full-text

InteractomeSeq: a web server for the identification and profiling of domains and epitopes from phage display and next generation sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkaa363 ◽

2020 ◽

Vol 48 (W1) ◽

pp. W200-W207

Author(s):

Simone Puccio ◽

Giorgio Grillo ◽

Arianna Consiglio ◽

Maria Felicia Soluri ◽

Daniele Sblattero ◽

...

Keyword(s):

Phage Display ◽

Large Scale ◽

High Throughput Sequencing ◽

Gene Annotation ◽

Web Server ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Phage Display Technology ◽

Essential Information ◽

Research Fields

Abstract High-Throughput Sequencing technologies are transforming many research fields, including the analysis of phage display libraries. The phage display technology coupled with deep sequencing was introduced more than a decade ago and holds the potential to circumvent the traditional laborious picking and testing of individual phage rescued clones. However, from a bioinformatics point of view, the analysis of this kind of data was always performed by adapting tools designed for other purposes, thus not considering the noise background typical of the ‘interactome sequencing’ approach and the heterogeneity of the data. InteractomeSeq is a web server allowing data analysis of protein domains (‘domainome’) or epitopes (‘epitome’) from either Eukaryotic or Prokaryotic genomic phage libraries generated and selected by following an Interactome sequencing approach. InteractomeSeq allows users to upload raw sequencing data and to obtain an accurate characterization of domainome/epitome profiles after setting the parameters required to tune the analysis. The release of this tool is relevant for the scientific and clinical community, because InteractomeSeq will fill an existing gap in the field of large-scale biomarkers profiling, reverse vaccinology, and structural/functional studies, thus contributing essential information for gene annotation or antigen identification. InteractomeSeq is freely available at https://InteractomeSeq.ba.itb.cnr.it/

Download Full-text

Higher Order Restrictions of the Immunoglobulin Repertoire in CLL: The Illustrative Case of Stereotyped Subsets #2 and #169

Blood ◽

10.1182/blood-2019-128017 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 5453-5453

Author(s):

Katerina Gemenetzi ◽

Andreas Agathangelidis ◽

Fotis Psomopoulos ◽

Karla Plevova ◽

Lesley-Ann Sutton ◽

...

Keyword(s):

Research Funding ◽

High Throughput Sequencing ◽

Somatic Hypermutation ◽

Illustrative Case ◽

Gene Rearrangements ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Mean Frequency ◽

Cdr3 Sequence ◽

Sample Range

Stereotyped subset #2 (IGHV3-21/IGLV3-21) is the largest subset in CLL (~3% of all patients). Membership in subset #2 is clinically relevant since these patients experience an aggressive disease irrespective of the somatic hypermutation (SHM) status of the clonotypic immunoglobulin heavy variable (IGHV) gene. Low-throughput evidence suggests that stereotyped subset #169, a minor CLL subset (~0.2% of all CLL), resembles subset #2 at the immunogenetic level. More specifically: (i) the clonotypic heavy chain (HC) of subset #169 is encoded by the IGHV3-48 gene which is closely related to the IGHV3-21 gene; (ii) both subsets carry VH CDR3s comprising 9-amino acids (aa) with a conserved aspartic acid (D) at VH CDR3 position 3; (iii) both subsets bear light chains (LC) encoded by the IGLV3-21 gene with a restricted VL CDR3; and, (iv) both subsets have borderline SHM status. Here we comprehensively assessed the ontogenetic relationship between CLL subsets #2 and #169 by analyzing their immunogenetic signatures. Utilizing next-generation sequencing (NGS) we studied the HC and LC gene rearrangements of 6 subset #169 patients and 20 subset #2 cases. In brief, IGHV-IGHD-IGHJ and IGLV-IGLJ gene rearrangements were RT-PCR amplified using subgroup-specific leader primers as well as IGHJ and IGLC primers, respectively. Libraries were sequenced on the MiSeq Illumina instrument. IG sequence annotation was performed with IMGT/HighV-QUEST and metadata analysis conducted using an in-house, validated bioinformatics pipeline. Rearrangements with identical CDR3 aa sequences were herein defined as clonotypes, whereas clonotypes with different aa substitutions within the V-domain were defined as subclones. For the HC analysis of subset #169, we obtained 894,849 productive sequences (mean: 127,836, range: 87,509-208,019). On average, each analyzed sample carried 54 clonotypes (range: 44-68); the dominant clonotype had a mean frequency of 99.1% (range: 98.8-99.2%) and displayed considerable intraclonal heterogeneity with a mean of 2,641 subclones/sample (range: 1,566-6,533). For the LCs of subset #169, we obtained 2,096,728 productive sequences (mean: 299,533, range: 186,637-389,258). LCs carried a higher number of distinct clonotypes/sample compared to their partner HCs (mean: 148, range: 110-205); the dominant clonotype had a mean frequency of 98.1% (range: 97.2-98.6%). Intraclonal heterogeneity was also observed in the LCs, with a mean of 6,325 subclones/sample (range: 4,651-11,444), hence more pronounced than in their partner HCs. Viewing each of the cumulative VH and VL CDR3 sequence datasets as a single entity branching through diversification enabled the identification of common sequences. In particular, 2 VH clonotypes were present in 3/6 cases, while a single VL clonotype was present in all 6 cases, albeit at varying frequencies; interestingly, this VL CDR3 sequence was also detected in all subset #2 cases, underscoring the molecular similarities between the two subsets. Focusing on SHM, the following observations were made: (i) the frequent 3-nucleotide (AGT) deletion evidenced in the VH CDR2 of subset #2 (leading to the deletion of one of 5 consecutive serine residues) was also detected in all subset #169 cases at subclonal level (average: 6% per sample, range: 0.1-10.8%); of note, the 5-serine stretch is also present in the germline VH CDR2 of the IGHV3-48 gene; (ii) the R-to-G substitution at the VL-CL linker, a ubiquitous SHM in subset #2 and previously reported as critical for IG self-association leading to cell autonomous signaling in this subset, was present in all subset #169 samples as a clonal event with a mean frequency of 98.3%; and, finally, (iii) the S-to-G substitution at position 6 of the VL CDR3, present in all subset #2 cases (mean : 44.2% ,range: 6.3-87%), was also found in all #169 samples, representing a clonal event in 1 case (97.2% of all clonotypes) and a subclonal event in the remaining 5 cases (mean: 0.6%, range: 0.4-1.1%). In conclusion, the present high-throughput sequencing data cements the immunogenetic relatedness of CLL stereotyped subsets #2 and #169, further highlighting the role of antigen selection throughout their natural history. These findings also argue for a similar pathophysiology for these subsets that could also be reflected in a similar clonal behavior, with implications for risk stratification. Disclosures Sutton: Abbvie: Honoraria; Gilead: Honoraria; Janssen: Honoraria. Stamatopoulos:Abbvie: Honoraria, Research Funding; Janssen: Honoraria, Research Funding. Chatzidimitriou:Janssen: Honoraria.

Download Full-text

PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets

Cancer Informatics ◽

10.4137/cin.s13890 ◽

2014 ◽

Vol 13s1 ◽

pp. CIN.S13890 ◽

Cited By ~ 1

Author(s):

Changjin Hong ◽

Solaiappan Manimaran ◽

William Evan Johnson

Keyword(s):

Quality Control ◽

High Throughput ◽

High Performance ◽

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Computationally Efficient ◽

High Throughput Sequencing Data ◽

Downstream Analysis

Quality control and read preprocessing are critical steps in the analysis of data sets generated from high-throughput genomic screens. In the most extreme cases, improper preprocessing can negatively affect downstream analyses and may lead to incorrect biological conclusions. Here, we present PathoQC, a streamlined toolkit that seamlessly combines the benefits of several popular quality control software approaches for preprocessing next-generation sequencing data. PathoQC provides a variety of quality control options appropriate for most high-throughput sequencing applications. PathoQC is primarily developed as a module in the PathoScope software suite for metagenomic analysis. However, PathoQC is also available as an open-source Python module that can run as a stand-alone application or can be easily integrated into any bioinformatics workflow. PathoQC achieves high performance by supporting parallel computation and is an effective tool that removes technical sequencing artifacts and facilitates robust downstream analysis. The PathoQC software package is available at http://sourceforge.net/projects/PathoScope/ .

Download Full-text

Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis

Advances in Intelligent Systems and Computing - Advanced Intelligent Systems for Sustainable Development (AI2SD’2019) ◽

10.1007/978-3-030-36664-3_43 ◽

2020 ◽

pp. 385-394

Author(s):

Razika Driouche

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Analysis ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

TEfinder: A Bioinformatics Pipeline for Detecting New Transposable Element Insertion Events in Next-Generation Sequencing Data

Genes ◽

10.3390/genes12020224 ◽

2021 ◽

Vol 12 (2) ◽

pp. 224

Author(s):

Vista Sohrab ◽

Cristina López-Díaz ◽

Antonio Di Pietro ◽

Li-Jun Ma ◽

Dilay Hazal Ayhan

Keyword(s):

Next Generation Sequencing Data ◽

Biological Processes ◽

Sequencing Data ◽

Short Term ◽

Bioinformatics Pipeline ◽

Genetic Changes ◽

Bioinformatics Software ◽

External Software ◽

Short Term Adaptation ◽

Generation Sequencing

Transposable elements (TEs) are mobile elements capable of introducing genetic changes rapidly. Their importance has been documented in many biological processes, such as introducing genetic instability, altering patterns of gene expression, and accelerating genome evolution. Increasing appreciation of TEs has resulted in a growing number of bioinformatics software to identify insertion events. However, the application of existing tools is limited by either narrow-focused design of the package, too many dependencies on other tools, or prior knowledge required as input files that may not be readily available to all users. Here, we reported a simple pipeline, TEfinder, developed for the detection of new TE insertions with minimal software and input file dependencies. The external software requirements are BEDTools, SAMtools, and Picard. Necessary input files include the reference genome sequence in FASTA format, an alignment file from paired-end reads, existing TEs in GTF format, and a text file of TE names. We tested TEfinder among several evolving populations of Fusarium oxysporum generated through a short-term adaptation study. Our results demonstrate that this easy-to-use tool can effectively detect new TE insertion events, making it accessible and practical for TE analysis.

Download Full-text

A bioinformatics pipeline for rare genetic diseases in South African patients

South African Journal of Science ◽

10.17159/sajs.2019/4876 ◽

2019 ◽

Vol 115 (3/4) ◽

Author(s):

Maryke Schoonen ◽

Albertus S. Seyffert ◽

Francois H. van Der Westhuizen ◽

Izelle Smuts

Keyword(s):

South Africa ◽

Next Generation Sequencing ◽

South African ◽

Rare Diseases ◽

Next Generation Sequencing Data ◽

Common Disease ◽

Next Generation ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Generation Sequencing

The research fields of bioinformatics and computational biology are growing rapidly in South Africa. Bioinformatics pipelines play an integral part in handling sequencing data, which are used to investigate the aetiology of common and rare diseases. Bioinformatics platforms for common disease aetiology are well supported and continuously being developed in South Africa. However, the same is not the case for rare diseases aetiology research. Investigations into the latter rely on international cloud-based tools for data analyses and ultimately confirmation of a genetic disease. However, these tools are not necessarily optimised for ethnically diverse population groups. We present an in-house developed bioinformatics pipeline to enable researchers to annotate and filter variants in either exome or amplicon next-generation sequencing data. This pipeline was developed using next-generation sequencing data of a predominantly African cohort of patients diagnosed with rare disease. Significance: We demonstrate the feasibility of in-country development of ethnicity-sensitive, automated bioinformatics pipelines using free software in a South African context. We provide a roadmap for development of similarly ethnicity-sensitive bioinformatics pipelines.

Download Full-text

ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04545-2 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Ludwig Mann ◽

Kathrin M. Seibt ◽

Beatrice Weber ◽

Tony Heitkam

Keyword(s):

Next Generation Sequencing ◽

Transposable Elements ◽

Data Availability ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Circular Dna ◽

Wide Range ◽

Generation Sequencing

Abstract Background Extrachromosomal circular DNAs (eccDNAs) are ring-like DNA structures physically separated from the chromosomes with 100 bp to several megabasepairs in size. Apart from carrying tandemly repeated DNA, eccDNAs may also harbor extra copies of genes or recently activated transposable elements. As eccDNAs occur in all eukaryotes investigated so far and likely play roles in stress, cancer, and aging, they have been prime targets in recent research—with their investigation limited by the scarcity of computational tools. Results Here, we present the ECCsplorer, a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing techniques. Following Illumina-sequencing of amplified circular DNA (circSeq), the ECCsplorer enables an easy and automated discovery of eccDNA candidates. The data analysis encompasses two major procedures: first, read mapping to the reference genome allows the detection of informative read distributions including high coverage, discordant mapping, and split reads. Second, reference-free comparison of read clusters from amplified eccDNA against control sample data reveals specifically enriched DNA circles. Both software parts can be run separately or jointly, depending on the individual aim or data availability. To illustrate the wide applicability of our approach, we analyzed semi-artificial and published circSeq data from the model organisms Homo sapiens and Arabidopsis thaliana, and generated circSeq reads from the non-model crop plant Beta vulgaris. We clearly identified eccDNA candidates from all datasets, with and without reference genomes. The ECCsplorer pipeline specifically detected mitochondrial mini-circles and retrotransposon activation, showcasing the ECCsplorer’s sensitivity and specificity. Conclusion The ECCsplorer (available online at https://github.com/crimBubble/ECCsplorer) is a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing data. The derived eccDNA targets are valuable for a wide range of downstream investigations—from analysis of cancer-related eccDNAs over organelle genomics to identification of active transposable elements.

Download Full-text