VERSO: a comprehensive framework for the inference of robust phylogenies and the quantification of intra-host genomic diversity of viral samples

SummaryWe introduce VERSO, a two-step framework for the characterization of viral evolution from sequencing data of viral genomes, which improves over phylogenomic approaches for consensus sequences. VERSO exploits an efficient algorithmic strategy to return robust phylogenies from clonal variant profiles, also in conditions of sampling limitations. It then leverages variant frequency patterns to characterize the intra-host genomic diversity of samples, revealing undetected infection chains and pinpointing variants likely involved in homoplasies. On simulations, VERSO outperforms state-of-the-art tools for phylogenetic inference. Notably, the application to 6726 Amplicon and RNA-seq samples refines the estimation of SARS-CoV-2 evolution, while co-occurrence patterns of minor variants unveil undetected infection paths, which are validated with contact tracing data. Finally, the analysis of SARS-CoV-2 mutational landscape uncovers a temporal increase of overall genomic diversity, and highlights variants transiting from minor to clonal state and homoplastic variants, some of which falling on the spike gene. Available at: https://github.com/BIMIB-DISCo/VERSO.

Download Full-text

Host-pathogen dynamics in longitudinal clinical specimens from patients with COVID-19

10.1101/2021.04.27.21256149 ◽

2021 ◽

Author(s):

Michelle J. Lin ◽

Victoria M. Rachleff ◽

Hong Xie ◽

Lasata Shrestha ◽

Nicole A.P. Lieberman ◽

...

Keyword(s):

Bacterial Species ◽

Viral Evolution ◽

Sequencing Data ◽

Low Frequencies ◽

Consensus Sequences ◽

Viral Genomes ◽

Viral Loads ◽

Public Repositories ◽

Over Time

AbstractBackgroundRapid dissemination of SARS-CoV-2 sequencing data to public repositories has enabled widespread study of viral genomes, but studies of longitudinal specimens from infected persons are relatively limited. Analysis of longitudinal specimens enables understanding of how host immune pressures drive viral evolution in vivo.Methods and findingsHere we performed sequencing of 49 longitudinal SARS-CoV-2-positive samples from 20 patients in Washington State collected between March and September of 2020. Viral loads declined over time with an average increase in RT-PCR cycle threshold (Ct) of 0.87 per day. We found that there was negligible change in SARS-CoV-2 consensus sequences over time, but identified a number of nonsynonymous variants at low frequencies across the genome. We observed enrichment for a relatively small number of these variants, all of which are now seen in consensus genomes across the globe at low prevalence. In one patient, we saw rapid emergence of various low-level deletion variants at the N-terminal domain of the spike glycoprotein, some of which have previously been shown to be associated with reduced neutralization potency from sera. In a subset of samples that were sequenced using metagenomic methods, differential gene expression analysis showed a downregulation of cytoskeletal genes that was consistent with a loss of ciliated epithelium during infection and recovery. We also identified co-occurrence of bacterial species in samples from multiple hospitalized individuals.ConclusionsThese results demonstrate that the intrahost genetic composition of SARS-CoV-2 is dynamic during the course of COVID-19, and highlight the need for continued surveillance and deep sequencing of minor variants.

Download Full-text

VCFCons: a versatile VCF-based consensus sequence generator for small genomes

10.1101/2021.02.26.433111 ◽

2021 ◽

Author(s):

Elizabeth Tseng ◽

Qiandong Zeng ◽

Lax Iyer

Keyword(s):

Consensus Sequence ◽

Low Frequency ◽

Sequencing Data ◽

Consensus Sequences ◽

The Future ◽

Variant Frequency ◽

Viral Surveillance ◽

Sequence Generator ◽

Low Coverage ◽

Robust Consensus

AbstractWe had developed VCFCons to address urgent need for a robust consensus sequence generator for SARS-CoV-2 viral surveillance, which presented several unique requirements, including: (a) low coverage areas should be noted with ‘N’s, (b) low frequency or suspicious variant calls need to be filtered. We have found that, while some existing tools such as bcftools can generate the desired consensus sequence, it required multiple filtering steps and additional scripting. VCFCons can generate consensus sequences based on variant calls in a VCF format with versatile filtering criteria based on coverage and estimated variant frequency. We applied VCFCons to the Labcorp SARS-CoV-2 sequencing data and showed that it generated correct consensus sequences that were successfully submitted to GISAID and NCBI. We hope the community will find value in this tool and aim to continue developing VCFCons to handle more complex viral data in the future.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

SARS-CoV-2 Sequence Characteristics of COVID-19 Persistence and Reinfection

Clinical Infectious Diseases ◽

10.1093/cid/ciab380 ◽

2021 ◽

Author(s):

Manish C Choudhary ◽

Charles R Crain ◽

Xueting Qiu ◽

William Hanage ◽

Jonathan Z Li

Keyword(s):

Persistent Infection ◽

Viral Evolution ◽

Phylogenetic Reconstruction ◽

Geographic Region ◽

Antibody Treatment ◽

Viral Genomes ◽

Monoclonal Antibody Treatment ◽

Viral Sequences ◽

Sequence Characteristics ◽

Baseline Health

Abstract Background Both SARS-CoV-2 reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared to both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intra-host viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and nine persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent COVID-19 demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.

Download Full-text

Whole-Genome Sequence Analysis of Pseudorabies Virus Clinical Isolates from Pigs in China between 2012 and 2017 in China

Viruses ◽

10.3390/v13071322 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1322

Author(s):

Ruiming Hu ◽

Leyi Wang ◽

Qingyun Liu ◽

Lin Hua ◽

Xi Huang ◽

...

Keyword(s):

Selection Pressure ◽

Pseudorabies Virus ◽

Viral Evolution ◽

Infectious Agent ◽

Genomic Diversity ◽

Whole Genome Sequence ◽

Evolutionary Constraint ◽

The Novel ◽

Novel Variants ◽

Pressure Analysis

Pseudorabies virus (PRV) is an economically significant swine infectious agent. A PRV outbreak took place in China in 2011 with novel virulent variants. Although the association of viral genomic variability with pathogenicity is not fully confirmed, the knowledge concerning PRV genomic diversity and evolution is still limited. Here, we sequenced 54 genomes of novel PRV variants isolated in China from 2012 to 2017. Phylogenetic analysis revealed that China strains and US/Europe strains were classified into two separate genotypes. PRV strains isolated from 2012 to 2017 in China are highly related to each other and genetically close to classic China strains such as Ea, Fa, and SC. RDP analysis revealed 23 recombination events within novel PRV variants, indicating that recombination contributes significantly to the viral evolution. The selection pressure analysis indicated that most ORFs were under evolutionary constraint, and 19 amino acid residue sites in 15 ORFs were identified under positive selection. Additionally, 37 unique mutations were identified in 19 ORFs, which distinguish the novel variants from classic strains. Overall, our study suggested that novel PRV variants might evolve from classical PRV strains through point mutation and recombination mechanisms.

Download Full-text

Phylogenomics reveals viral sources, transmission, and potential superinfection in early-stage COVID-19 patients in Ontario, Canada

Scientific Reports ◽

10.1038/s41598-021-83355-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Calvin P. Sjaarda ◽

Nazneen Rustom ◽

Gerald A. Evans ◽

David Huang ◽

Santiago Perez-Patrigeon ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Disease Surveillance ◽

Early Stage ◽

Emerging Diseases ◽

Contact Tracing ◽

Travel History ◽

Infectious Disease Surveillance ◽

Viral Genomes ◽

Heterozygous Variant ◽

Sequencing Platforms

AbstractThe emergence and rapid global spread of SARS-CoV-2 demonstrates the importance of infectious disease surveillance, particularly during the early stages. Viral genomes can provide key insights into transmission chains and pathogenicity. Nasopharyngeal swabs were obtained from thirty-two of the first SARS-CoV-2 positive cases (March 18–30) in Kingston Ontario, Canada. Viral genomes were sequenced using Ion Torrent (n = 24) and MinION (n = 27) sequencing platforms. SARS-CoV-2 genomes carried forty-six polymorphic sites including two missense and three synonymous variants in the spike protein gene. The D614G point mutation was the predominate viral strain in our cohort (92.6%). A heterozygous variant (C9994A) was detected by both sequencing platforms but filtered by the ARTIC network bioinformatic pipeline suggesting that heterozygous variants may be underreported in the SARS-CoV-2 literature. Phylogenetic analysis with 87,738 genomes in the GISAID database identified global origins and transmission events including multiple, international introductions as well as community spread. Reported travel history validated viral introduction and transmission inferred by phylogenetic analysis. Molecular epidemiology and evolutionary phylogenetics may complement contact tracing and help reconstruct transmission chains of emerging diseases. Earlier detection and screening in this way could improve the effectiveness of regional public health interventions to limit future pandemics.

Download Full-text

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq343 ◽

2010 ◽

Vol 26 (17) ◽

pp. 2101-2108 ◽

Cited By ~ 27

Author(s):

Jiří Macas ◽

Pavel Neumann ◽

Petr Novák ◽

Jiming Jiang

Keyword(s):

Large Scale ◽

Rice Genome ◽

Supplementary Information ◽

Sequencing Data ◽

Satellite Repeat ◽

Frequency Spectra ◽

Consensus Sequences ◽

Chip Sequencing ◽

Conserved Sequence ◽

Centromeric Satellite

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Control of artefactual variation in reported inter-sample relatedness during clinical use of a Mycobacterium tuberculosis sequencing pipeline

10.1101/252460 ◽

2018 ◽

Author(s):

David H Wyllie ◽

Nicholas Sanderson ◽

Richard Myers ◽

Tim Peto ◽

Esther Robinson ◽

...

Keyword(s):

Consensus Sequence ◽

Read Depth ◽

Pairwise Distance ◽

Contact Tracing ◽

Clinical Samples ◽

Bacterial Dna ◽

Consensus Sequences ◽

Minor Variant ◽

Validation Set ◽

Genomic Regions

ABSTRACTContact tracing requires reliable identification of closely related bacterial isolates. When we noticed the reporting of artefactual variation between M. tuberculosis isolates during routine next generation sequencing of Mycobacterium spp, we investigated its basis in 2,018 consecutive M. tuberculosis isolates. In the routine process used, clinical samples were decontaminated and inoculated into broth cultures; from positive broth cultures DNA was extracted, sequenced, reads mapped, and consensus sequences determined. We investigated the process of consensus sequence determination, which selects the most common nucleotide at each position. Having determined the high-quality read depth and depth of minor variants across 8,006 M. tuberculosis genomic regions, we quantified the relationship between the minor variant depth and the amount of non-Mycobacterial bacterial DNA, which originates from commensal microbes killed during sample decontamination. In the presence of non-Mycobacterial bacterial DNA, we found significant increases in minor variant frequencies of more than 1.5 fold in 242 regions covering 5.1% of the M. tuberculosis genome. Included within these were four high variation regions strongly influenced by the amount of non-Mycobacterial bacterial DNA. Excluding these four regions from pairwise distance comparisons reduced biologically implausible variation from 5.2% to 0% in an independent validation set derived from 226 individuals. Thus, we have demonstrated an approach identifying critical genomic regions contributing to clinically relevant artefactual variation in bacterial similarity searches. The approach described monitors the outputs of the complex multi-step laboratory and bioinformatics process, allows periodic process adjustments, and will have application to quality control of routine bacterial genomics.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

Swiss public health measures associated with reduced SARS-CoV-2 transmission using genome data

10.1101/2021.11.11.21266107 ◽

2021 ◽

Author(s):

Sarah A. Nadeau ◽

Timothy G. Vaughan ◽

Christiane Beckmann ◽

Ivan Topolsky ◽

Chaoran Chen ◽

...

Keyword(s):

Transmission Dynamics ◽

Added Value ◽

Outbreak Detection ◽

Contact Tracing ◽

Sequencing Data ◽

Genome Sequences ◽

Health Measures ◽

Genome Data ◽

Transmission Chain ◽

Local Transmission

Genome sequences allow quantification of changes in case introductions from abroad and local transmission dynamics. We sequenced 11,357 SARS-CoV-2 genomes from Switzerland in 2020 - the 6th largest effort globally. Using these data, we estimated introductions and their persistence throughout 2020. By contrasting estimates with null models, we estimate at least 83% of introductions were adverted during Switzerland's border closures. Further, transmission chain persistence roughly doubled after the partial lockdown was lifted. Then, using a novel phylodynamic method, we suggest transmission in newly introduced outbreaks slowed 36 - 64% upon outbreak detection in summer 2020, but not in fall. This could indicate successful contact tracing over summer before overburdening in fall. The study highlights the added value of genome sequencing data for understanding transmission dynamics.

Download Full-text