WgLink: reconstructing whole-genome viral haplotypes using L0 + L1-regularization

Abstract Summary Many tools can reconstruct viral sequences based on next-generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression, synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and on real datasets while using significantly less memory (RAM) and fewer CPU hours. Availability and implementation Source code and binaries are freely available at https://github.com/theLongLab/wglink. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

WgLink: reconstructing whole-genome viral haplotypes using L0 + L1-regularization

10.1101/2020.08.14.251835 ◽

2020 ◽

Author(s):

Chen Cao ◽

Matthew Greenberg ◽

Quan Long

Keyword(s):

Real Data ◽

Data Sets ◽

Whole Genome ◽

Regularized Regression ◽

Viral Genomes ◽

Physical Linkage ◽

Multiple Regions ◽

Viral Sequences ◽

Multiple Variants ◽

Generation Sequencing

AbstractMany tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0 + L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. Source code and binaries are freely available at https://github.com/theLongLab/wglink.

Download Full-text

The effect of variant interference on de novo assembly for viral deep sequencing

10.1101/815480 ◽

2019 ◽

Cited By ~ 1

Author(s):

Christina J. Castro ◽

Rachel L. Marine ◽

Edward Ramos ◽

Terry Fei Fan Ng

Keyword(s):

Deep Sequencing ◽

De Novo ◽

Gc Content ◽

Read Length ◽

Viral Genomes ◽

Minor Variant ◽

Main Driver ◽

Next Generation Sequencing Ngs ◽

Viral Sequences ◽

Generation Sequencing

AbstractViruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approach has surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. Our results from >15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten most used de novo assemblers, and occurs independent of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs. These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.

Download Full-text

Human adenovirus penton base and encapsidation sequences detected in Pelodiscus sinensis by next generation sequencing

Future Virology ◽

10.2217/fvl-2019-0056 ◽

2019 ◽

Vol 14 (7) ◽

pp. 453-460

Author(s):

Cheng Xu ◽

Jiehao Xu ◽

Jiating Liu ◽

Yu Chen ◽

Øystein Evensen ◽

...

Keyword(s):

Next Generation Sequencing ◽

Human Adenovirus ◽

South East Asia ◽

Next Generation ◽

Penton Base ◽

Pelodiscus Sinensis ◽

Viral Genomes ◽

Viral Sequences ◽

Generation Sequencing

The Chinese soft-shelled turtle ( Pelodiscus sinensis) has become one of the leading cultured organisms in China and South East Asia. The objectives of the present study were to use next generation sequencing to identify viral genomes present in liver tissues from Chinese soft-shelled turtle in China. BLAST analysis of viral sequences from liver samples showed high homology with the human adenovirus (HAdV) penton base and encapsidation proteins. This homology points to possible existence of HAdV in freshwater environments used for the culture of soft-shelled turtles. Therefore, our findings merit further investigations to determine possible contamination of HAdV in aquaculture environments and the possible role of the Chinese soft-shelled turtle in transmitting HAdV to humans.

Download Full-text

Comparative whole genome analysis of nucleopolyhedroviruses infecting saturniid silkworms by next-generation sequencing

10.1603/ice.2016.114235 ◽

2016 ◽

Author(s):

Jun Kobayashi

Keyword(s):

Next Generation Sequencing ◽

Genome Analysis ◽

Whole Genome ◽

Next Generation ◽

Whole Genome Analysis ◽

Generation Sequencing

Download Full-text

Laboratory implementation of Next Generation Sequencing for medium and high throughput whole genome analysis of clinical M. tuberculosis complex strains

10.26226/morressier.5991c409d462b80292388d15 ◽

2017 ◽

Author(s):

Christian Utpatel

Keyword(s):

Next Generation Sequencing ◽

High Throughput ◽

Genome Analysis ◽

Whole Genome ◽

Next Generation ◽

Tuberculosis Complex ◽

Whole Genome Analysis ◽

Generation Sequencing

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Detection of SARS-CoV-2 from patient fecal samples by whole genome sequencing

Gut Pathogens ◽

10.1186/s13099-021-00398-5 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Andreas Papoutsis ◽

Thomas Borody ◽

Siba Dolai ◽

Jordan Daniels ◽

Skylar Steinberg ◽

...

Keyword(s):

Next Generation Sequencing ◽

Nasopharyngeal Swab ◽

Whole Genome ◽

Rt Pcr ◽

Whole Genome Analysis ◽

Fecal Samples ◽

Oral Transmission ◽

Study Participants ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Background SARS-CoV-2 has been detected not only in respiratory secretions, but also in stool collections. Here were sought to identify SARS-CoV-2 by enrichment next-generation sequencing (NGS) from fecal samples, and to utilize whole genome analysis to characterize SARS-CoV-2 mutational variations in COVID-19 patients. Results Study participants underwent testing for SARS-CoV-2 from fecal samples by whole genome enrichment NGS (n = 14), and RT-PCR nasopharyngeal swab analysis (n = 12). The concordance of SARS-CoV-2 detection by enrichment NGS from stools with RT-PCR nasopharyngeal analysis was 100%. Unique variants were identified in four patients, with a total of 33 different mutations among those in which SARS-CoV-2 was detected by whole genome enrichment NGS. Conclusion These results highlight the potential viability of SARS-CoV-2 in feces, its ongoing mutational accumulation, and its possible role in fecal–oral transmission. This study also elucidates the advantages of SARS-CoV-2 enrichment NGS, which may be a key methodology to document complete viral eradication. Trial registration ClinicalTrials.gov, NCT04359836, Registered 24 April 2020, https://clinicaltrials.gov/ct2/show/NCT04359836?term=NCT04359836&draw=2&rank=1).

Download Full-text

Unmasking viral sequences by metagenomic next-generation sequencing in adult human blood samples during steroid-refractory/dependent graft-versus-host disease

Microbiome ◽

10.1186/s40168-020-00953-3 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

M. C. Zanella ◽

S. Cordey ◽

F. Laubscher ◽

M. Docquier ◽

G. Vieille ◽

...

Keyword(s):

Next Generation Sequencing ◽

Graft Versus Host Disease ◽

Rubella Virus ◽

Viral Infections ◽

Next Generation ◽

Host Disease ◽

Graft Versus Host ◽

Viral Sequences ◽

Generation Sequencing ◽

Steroid Refractory

Abstract Background Viral infections are common complications following allogeneic hematopoietic stem cell transplantation (allo-HSCT). Allo-HSCT recipients with steroid-refractory/dependent graft-versus-host disease (GvHD) are highly immunosuppressed and are more vulnerable to infections with weakly pathogenic or commensal viruses. Here, twenty-five adult allo-HSCT recipients from 2016 to 2019 with acute or chronic steroid-refractory/dependent GvHD were enrolled in a prospective cohort at Geneva University Hospitals. We performed metagenomics next-generation sequencing (mNGS) analysis using a validated pipeline and de novo analysis on pooled routine plasma samples collected throughout the period of intensive steroid treatment or second-line GvHD therapy to identify weakly pathogenic, commensal, and unexpected viruses. Results Median duration of intensive immunosuppression was 5.1 months (IQR 5.5). GvHD-related mortality rate was 36%. mNGS analysis detected viral nucleotide sequences in 24/25 patients. Sequences of ≥ 3 distinct viruses were detected in 16/25 patients; Anelloviridae (24/25) and human pegivirus-1 (9/25) were the most prevalent. In 7 patients with fatal outcomes, viral sequences not assessed by routine investigations were identified with mNGS and confirmed by RT-PCR. These cases included Usutu virus (1), rubella virus (1 vaccine strain and 1 wild-type), novel human astrovirus (HAstV) MLB2 (1), classic HAstV (1), human polyomavirus 6 and 7 (2), cutavirus (1), and bufavirus (1). Conclusions Clinically unrecognized viral infections were identified in 28% of highly immunocompromised allo-HSCT recipients with steroid-refractory/dependent GvHD in consecutive samples. These identified viruses have all been previously described in humans, but have poorly understood clinical significance. Rubella virus identification raises the possibility of re-emergence from past infections or vaccinations, or re-infection.

Download Full-text

SARS-CoV-2 Sequence Characteristics of COVID-19 Persistence and Reinfection

Clinical Infectious Diseases ◽

10.1093/cid/ciab380 ◽

2021 ◽

Author(s):

Manish C Choudhary ◽

Charles R Crain ◽

Xueting Qiu ◽

William Hanage ◽

Jonathan Z Li

Keyword(s):

Persistent Infection ◽

Viral Evolution ◽

Phylogenetic Reconstruction ◽

Geographic Region ◽

Antibody Treatment ◽

Viral Genomes ◽

Monoclonal Antibody Treatment ◽

Viral Sequences ◽

Sequence Characteristics ◽

Baseline Health

Abstract Background Both SARS-CoV-2 reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared to both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intra-host viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and nine persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent COVID-19 demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text