VCFCons: a versatile VCF-based consensus sequence generator for small genomes

AbstractWe had developed VCFCons to address urgent need for a robust consensus sequence generator for SARS-CoV-2 viral surveillance, which presented several unique requirements, including: (a) low coverage areas should be noted with ‘N’s, (b) low frequency or suspicious variant calls need to be filtered. We have found that, while some existing tools such as bcftools can generate the desired consensus sequence, it required multiple filtering steps and additional scripting. VCFCons can generate consensus sequences based on variant calls in a VCF format with versatile filtering criteria based on coverage and estimated variant frequency. We applied VCFCons to the Labcorp SARS-CoV-2 sequencing data and showed that it generated correct consensus sequences that were successfully submitted to GISAID and NCBI. We hope the community will find value in this tool and aim to continue developing VCFCons to handle more complex viral data in the future.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

VERSO: a comprehensive framework for the inference of robust phylogenies and the quantification of intra-host genomic diversity of viral samples

10.1101/2020.04.22.044404 ◽

2020 ◽

Cited By ~ 5

Author(s):

Daniele Ramazzotti ◽

Fabrizio Angaroni ◽

Davide Maspero ◽

Carlo Gambacorti-Passerini ◽

Marco Antoniotti ◽

...

Keyword(s):

Viral Evolution ◽

Genomic Diversity ◽

Contact Tracing ◽

Sequencing Data ◽

Spike Gene ◽

Consensus Sequences ◽

Viral Genomes ◽

Variant Frequency ◽

Algorithmic Strategy ◽

Comprehensive Framework

SummaryWe introduce VERSO, a two-step framework for the characterization of viral evolution from sequencing data of viral genomes, which improves over phylogenomic approaches for consensus sequences. VERSO exploits an efficient algorithmic strategy to return robust phylogenies from clonal variant profiles, also in conditions of sampling limitations. It then leverages variant frequency patterns to characterize the intra-host genomic diversity of samples, revealing undetected infection chains and pinpointing variants likely involved in homoplasies. On simulations, VERSO outperforms state-of-the-art tools for phylogenetic inference. Notably, the application to 6726 Amplicon and RNA-seq samples refines the estimation of SARS-CoV-2 evolution, while co-occurrence patterns of minor variants unveil undetected infection paths, which are validated with contact tracing data. Finally, the analysis of SARS-CoV-2 mutational landscape uncovers a temporal increase of overall genomic diversity, and highlights variants transiting from minor to clonal state and homoplastic variants, some of which falling on the spike gene. Available at: https://github.com/BIMIB-DISCo/VERSO.

Download Full-text

Genomic characterization and phylogenetic analysis of the first SARS-CoV-2 variants introduced in Lebanon

PeerJ ◽

10.7717/peerj.11015 ◽

2021 ◽

Vol 9 ◽

pp. e11015

Author(s):

Rita Feghali ◽

Georgi Merhi ◽

Aurelia Kwasiborski ◽

Veronique Hourdel ◽

Nada Ghosn ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Consensus Sequence ◽

Unknown Origin ◽

International Travel ◽

Molecular Diagnostic ◽

Diagnostic Tools ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Genetic Lineages ◽

Consensus Sequences

Background In December 2019, the COVID-19 pandemic initially erupted from a cluster of pneumonia cases of unknown origin in the city of Wuhan, China. Presently, it has almost reached 94 million cases worldwide. Lebanon on the brink of economic collapse and its healthcare system thrown into turmoil, has previously managed to cope with the initial SARS-CoV-2 wave. In this study, we sequenced 11 viral genomes from positive cases isolated between 2 February 2020 and 15 March 2020. Methods Sequencing data was quality controlled, consensus sequences generated, and a maximum-likelihood tree was generated with IQTREE v2. Genetic lineages were assigned with Pangolin v1.1.14 and single nucleotide variants (SNVs) were called from read files and manually curated from consensus sequence alignment through JalView v2.11 and the genomic mutational interference with molecular diagnostic tools was assessed with the CoV-GLUE pipeline. Phylogenetic analysis of whole genome sequences confirmed a multiple introduction scenario due to international travel. Results Three major lineages were identified to be circulating in Lebanon in the studied period. The B.1 (20A clade) was the most prominent, followed by the B.4 lineage (19A clade) and the B.1.1 lineage (20B clade). SNV analysis showed 15 novel mutations from which only one was observed in the spike region.

Download Full-text

Early detection and improved genomic surveillance of SARS-CoV-2 variants from deep sequencing data

10.1101/2021.12.14.21267810 ◽

2021 ◽

Author(s):

Daniele Ramazzotti ◽

Davide Maspero ◽

Fabrizio Angaroni ◽

Marco Antoniotti ◽

Rocco Piazza ◽

...

Keyword(s):

Early Detection ◽

Deep Sequencing ◽

Low Frequency ◽

Sequencing Data ◽

Robust Identification ◽

Consensus Sequences ◽

Deep Sequencing Data ◽

Automated Support ◽

Definition Of ◽

Mutational Processes

In the definition of fruitful strategies to contrast the worldwide diffusion of SARS-CoV-2, maximum efforts must be devoted to the early detection of dangerous variants. An effective help to this end is granted by the analysis of deep sequencing data of viral samples, which are typically discarded after the creation of consensus sequences. Indeed, only with deep sequencing data it is possible to identify intra-host low-frequency mutations, which are a direct footprint of mutational processes that may eventually lead to the origination of functionally advantageous variants. Accordingly, a timely and statistically robust identification of such mutations might inform political decision-making with significant anticipation with respect to standard analyses based on consensus sequences. To support our claim, we here present the largest study to date of SARS-CoV-2 deep sequencing data, which involves 220,788 high quality samples, collected over 20 months from 137 distinct studies. Importantly, we show that a relevant number of spike and nucleocapsid mutations of interest associated to the most circulating variants, including Beta, Delta and Omicron, might have been intercepted several months in advance, possibly leading to different public-health decisions. In addition, we show that a refined genomic surveillance system involving high- and low-frequency mutations might allow one to pinpoint possibly dangerous emerging mutation patterns, providing a data-driven automated support to epidemiologists and virologists.

Download Full-text

Accucopy: accurate and fast inference of allele-specific copy number alterations from low-coverage low-purity tumor sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03924-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xinping Fan ◽

Guanghao Luo ◽

Yu S. Huang

Keyword(s):

Copy Number ◽

Bayesian Learning ◽

Kernel Smoothing ◽

Gaussian Mixture ◽

Copy Number Alterations ◽

Sequencing Data ◽

Copy Numbers ◽

Allele Specific ◽

Tumor Sequencing ◽

Low Coverage

Abstract Background Copy number alterations (CNAs), due to their large impact on the genome, have been an important contributing factor to oncogenesis and metastasis. Detecting genomic alterations from the shallow-sequencing data of a low-purity tumor sample remains a challenging task. Results We introduce Accucopy, a method to infer total copy numbers (TCNs) and allele-specific copy numbers (ASCNs) from challenging low-purity and low-coverage tumor samples. Accucopy adopts many robust statistical techniques such as kernel smoothing of coverage differentiation information to discern signals from noise and combines ideas from time-series analysis and the signal-processing field to derive a range of estimates for the period in a histogram of coverage differentiation information. Statistical learning models such as the tiered Gaussian mixture model, the expectation–maximization algorithm, and sparse Bayesian learning were customized and built into the model. Accucopy is implemented in C++ /Rust, packaged in a docker image, and supports non-human samples, more at http://www.yfish.org/software/. Conclusions We describe Accucopy, a method that can predict both TCNs and ASCNs from low-coverage low-purity tumor sequencing data. Through comparative analyses in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than Sclust, ABSOLUTE, and Sequenza.

Download Full-text

Publisher Correction: Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Nature Genetics ◽

10.1038/s41588-021-00788-0 ◽

2021 ◽

Author(s):

Simone Rubinacci ◽

Diogo M. Ribeiro ◽

Robin J. Hofmeister ◽

Olivier Delaneau

Keyword(s):

Sequencing Data ◽

Low Coverage

Download Full-text

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq343 ◽

2010 ◽

Vol 26 (17) ◽

pp. 2101-2108 ◽

Cited By ~ 27

Author(s):

Jiří Macas ◽

Pavel Neumann ◽

Petr Novák ◽

Jiming Jiang

Keyword(s):

Large Scale ◽

Rice Genome ◽

Supplementary Information ◽

Sequencing Data ◽

Satellite Repeat ◽

Frequency Spectra ◽

Consensus Sequences ◽

Chip Sequencing ◽

Conserved Sequence ◽

Centromeric Satellite

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Shear stress induces hepatocyte PAI-1 gene expression through cooperative Sp1/Ets-1 activation of transcription

AJP Gastrointestinal and Liver Physiology ◽

10.1152/ajpgi.00467.2005 ◽

2006 ◽

Vol 291 (1) ◽

pp. G26-G34 ◽

Cited By ~ 30

Author(s):

Hideki Nakatsuka ◽

Takaaki Sokabe ◽

Kimiko Yamamoto ◽

Yoshinobu Sato ◽

Katsuyoshi Hatakeyama ◽

...

Keyword(s):

Gene Expression ◽

Shear Stress ◽

Consensus Sequence ◽

Early Gene ◽

Immediate Early ◽

Mrna Levels ◽

Consensus Sequences ◽

Static Conditions ◽

Stress Dependent ◽

Pai 1

Partial hepatectomy causes hemodynamic changes that increase portal blood flow in the remaining lobe, where the expression of immediate-early genes, including plasminogen activator inhibitor-1 (PAI-1), is induced. We hypothesized that a hyperdynamic circulatory state occurring in the remaining lobe induces immediate-early gene expression. In this study, we investigated whether the mechanical force generated by flowing blood, shear stress, induces PAI-1 expression in hepatocytes. When cultured rat hepatocytes were exposed to flow, PAI-1 mRNA levels began to increase within 3 h, peaked at levels significantly higher than the static control levels, and then gradually decreased. The flow-induced PAI-1 expression was shear stress dependent rather than shear rate dependent and accompanied by increased hepatocyte production of PAI-1 protein. Shear stress increased PAI-1 transcription but did not affect PAI-1 mRNA stability. Functional analysis of the 2.1-kb PAI-1 5′-promoter indicated that a 278-bp segment containing transcription factor Sp1 and Ets-1 consensus sequences was critical to the shear stress-dependent increase of PAI-1 transcription. Mutations of both the Sp1 and Ets-1 consensus sequences, but not of either one alone, markedly prevented basal PAI-1 transcription and abolished the response of the PAI-1 promoter to shear stress. EMSA and chromatin immunoprecipitation assays showed binding of Sp1 and Ets-1 to each consensus sequence under static conditions, which increased in response to shear stress. In conclusion, hepatocyte PAI-1 expression is flow sensitive and transcriptionally regulated by shear stress via cooperative interactions between Sp1 and Ets-1.

Download Full-text

Control of artefactual variation in reported inter-sample relatedness during clinical use of a Mycobacterium tuberculosis sequencing pipeline

10.1101/252460 ◽

2018 ◽

Author(s):

David H Wyllie ◽

Nicholas Sanderson ◽

Richard Myers ◽

Tim Peto ◽

Esther Robinson ◽

...

Keyword(s):

Consensus Sequence ◽

Read Depth ◽

Pairwise Distance ◽

Contact Tracing ◽

Clinical Samples ◽

Bacterial Dna ◽

Consensus Sequences ◽

Minor Variant ◽

Validation Set ◽

Genomic Regions

ABSTRACTContact tracing requires reliable identification of closely related bacterial isolates. When we noticed the reporting of artefactual variation between M. tuberculosis isolates during routine next generation sequencing of Mycobacterium spp, we investigated its basis in 2,018 consecutive M. tuberculosis isolates. In the routine process used, clinical samples were decontaminated and inoculated into broth cultures; from positive broth cultures DNA was extracted, sequenced, reads mapped, and consensus sequences determined. We investigated the process of consensus sequence determination, which selects the most common nucleotide at each position. Having determined the high-quality read depth and depth of minor variants across 8,006 M. tuberculosis genomic regions, we quantified the relationship between the minor variant depth and the amount of non-Mycobacterial bacterial DNA, which originates from commensal microbes killed during sample decontamination. In the presence of non-Mycobacterial bacterial DNA, we found significant increases in minor variant frequencies of more than 1.5 fold in 242 regions covering 5.1% of the M. tuberculosis genome. Included within these were four high variation regions strongly influenced by the amount of non-Mycobacterial bacterial DNA. Excluding these four regions from pairwise distance comparisons reduced biologically implausible variation from 5.2% to 0% in an independent validation set derived from 226 individuals. Thus, we have demonstrated an approach identifying critical genomic regions contributing to clinically relevant artefactual variation in bacterial similarity searches. The approach described monitors the outputs of the complex multi-step laboratory and bioinformatics process, allows periodic process adjustments, and will have application to quality control of routine bacterial genomics.

Download Full-text

Sensitive detection of tumor mutations from blood and its application to immunotherapy prognosis

10.1101/2019.12.31.19016253 ◽

2020 ◽

Author(s):

Shuo Li ◽

Zorawar Noor ◽

Weihua Zeng ◽

Xiaohui Ni ◽

Zuyang Yuan ◽

...

Keyword(s):

Low Frequency ◽

Sequencing Data ◽

Real Patient ◽

Single Nucleotide ◽

Lung Cancer Patients ◽

Wide Range ◽

Single Nucleotide Variations ◽

Innovative Techniques ◽

Error Suppression ◽

Mutation Profiling

AbstractLiquid biopsy using cell-free DNA (cfDNA) is attractive for a wide range of clinical applications, including cancer detection, locating, and monitoring. However, developing these applications requires precise and sensitive calling of somatic single nucleotide variations (SNVs) from cfDNA sequencing data. To date, no SNV caller addresses all the special challenges of cfDNA to provide reliable results. Here we present cfSNV, a revolutionary somatic SNV caller with five innovative techniques to overcome and exploit the unique properties of cfDNA. cfSNV provides hierarchical mutation profiling, thanks to cfDNA’s complete coverage of the clonal landscape, and multi-layer error suppression. In both simulated datasets and real patient data, we demonstrate that cfSNV is superior to existing tools, especially for low-frequency somatic SNVs. We also show how the five novel techniques contribute to its performance. Further, we demonstrate a clinical application using cfSNV to select non-small-cell lung cancer patients for immunotherapy treatment.

Download Full-text