scholarly journals WgLink: reconstructing whole-genome viral haplotypes using L0 + L1-regularization

Author(s):  
Chen Cao ◽  
Matthew Greenberg ◽  
Quan Long

Abstract Summary Many tools can reconstruct viral sequences based on next-generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression, synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and on real datasets while using significantly less memory (RAM) and fewer CPU hours. Availability and implementation Source code and binaries are freely available at https://github.com/theLongLab/wglink. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Chen Cao ◽  
Matthew Greenberg ◽  
Quan Long

AbstractMany tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0 + L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. Source code and binaries are freely available at https://github.com/theLongLab/wglink.


2019 ◽  
Author(s):  
Christina J. Castro ◽  
Rachel L. Marine ◽  
Edward Ramos ◽  
Terry Fei Fan Ng

AbstractViruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approach has surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. Our results from >15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten most used de novo assemblers, and occurs independent of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs. These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.


2019 ◽  
Vol 14 (7) ◽  
pp. 453-460
Author(s):  
Cheng Xu ◽  
Jiehao Xu ◽  
Jiating Liu ◽  
Yu Chen ◽  
Øystein Evensen ◽  
...  

The Chinese soft-shelled turtle ( Pelodiscus sinensis) has become one of the leading cultured organisms in China and South East Asia. The objectives of the present study were to use next generation sequencing to identify viral genomes present in liver tissues from Chinese soft-shelled turtle in China. BLAST analysis of viral sequences from liver samples showed high homology with the human adenovirus (HAdV) penton base and encapsidation proteins. This homology points to possible existence of HAdV in freshwater environments used for the culture of soft-shelled turtles. Therefore, our findings merit further investigations to determine possible contamination of HAdV in aquaculture environments and the possible role of the Chinese soft-shelled turtle in transmitting HAdV to humans.


Author(s):  
Amnon Koren ◽  
Dashiell J Massey ◽  
Alexa N Bracci

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online


Gut Pathogens ◽  
2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Andreas Papoutsis ◽  
Thomas Borody ◽  
Siba Dolai ◽  
Jordan Daniels ◽  
Skylar Steinberg ◽  
...  

Abstract Background SARS-CoV-2 has been detected not only in respiratory secretions, but also in stool collections. Here were sought to identify SARS-CoV-2 by enrichment next-generation sequencing (NGS) from fecal samples, and to utilize whole genome analysis to characterize SARS-CoV-2 mutational variations in COVID-19 patients. Results Study participants underwent testing for SARS-CoV-2 from fecal samples by whole genome enrichment NGS (n = 14), and RT-PCR nasopharyngeal swab analysis (n = 12). The concordance of SARS-CoV-2 detection by enrichment NGS from stools with RT-PCR nasopharyngeal analysis was 100%. Unique variants were identified in four patients, with a total of 33 different mutations among those in which SARS-CoV-2 was detected by whole genome enrichment NGS. Conclusion These results highlight the potential viability of SARS-CoV-2 in feces, its ongoing mutational accumulation, and its possible role in fecal–oral transmission. This study also elucidates the advantages of SARS-CoV-2 enrichment NGS, which may be a key methodology to document complete viral eradication. Trial registration ClinicalTrials.gov, NCT04359836, Registered 24 April 2020, https://clinicaltrials.gov/ct2/show/NCT04359836?term=NCT04359836&draw=2&rank=1).


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
M. C. Zanella ◽  
S. Cordey ◽  
F. Laubscher ◽  
M. Docquier ◽  
G. Vieille ◽  
...  

Abstract Background Viral infections are common complications following allogeneic hematopoietic stem cell transplantation (allo-HSCT). Allo-HSCT recipients with steroid-refractory/dependent graft-versus-host disease (GvHD) are highly immunosuppressed and are more vulnerable to infections with weakly pathogenic or commensal viruses. Here, twenty-five adult allo-HSCT recipients from 2016 to 2019 with acute or chronic steroid-refractory/dependent GvHD were enrolled in a prospective cohort at Geneva University Hospitals. We performed metagenomics next-generation sequencing (mNGS) analysis using a validated pipeline and de novo analysis on pooled routine plasma samples collected throughout the period of intensive steroid treatment or second-line GvHD therapy to identify weakly pathogenic, commensal, and unexpected viruses. Results Median duration of intensive immunosuppression was 5.1 months (IQR 5.5). GvHD-related mortality rate was 36%. mNGS analysis detected viral nucleotide sequences in 24/25 patients. Sequences of ≥ 3 distinct viruses were detected in 16/25 patients; Anelloviridae (24/25) and human pegivirus-1 (9/25) were the most prevalent. In 7 patients with fatal outcomes, viral sequences not assessed by routine investigations were identified with mNGS and confirmed by RT-PCR. These cases included Usutu virus (1), rubella virus (1 vaccine strain and 1 wild-type), novel human astrovirus (HAstV) MLB2 (1), classic HAstV (1), human polyomavirus 6 and 7 (2), cutavirus (1), and bufavirus (1). Conclusions Clinically unrecognized viral infections were identified in 28% of highly immunocompromised allo-HSCT recipients with steroid-refractory/dependent GvHD in consecutive samples. These identified viruses have all been previously described in humans, but have poorly understood clinical significance. Rubella virus identification raises the possibility of re-emergence from past infections or vaccinations, or re-infection.


Author(s):  
Manish C Choudhary ◽  
Charles R Crain ◽  
Xueting Qiu ◽  
William Hanage ◽  
Jonathan Z Li

Abstract Background Both SARS-CoV-2 reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared to both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intra-host viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and nine persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent COVID-19 demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.


2020 ◽  
Vol 36 (12) ◽  
pp. 3669-3679 ◽  
Author(s):  
Can Firtina ◽  
Jeremie S Kim ◽  
Mohammed Alser ◽  
Damla Senol Cali ◽  
A Ercument Cicek ◽  
...  

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document