A Novel SARS-CoV-2 Viral Sequence Bioinformatic Pipeline Has Found Genetic Evidence That the Viral 3′ Untranslated Region (UTR) Is Evolving and Generating Increased Viral Diversity

An unprecedented amount of SARS-CoV-2 sequencing has been performed, however, novel bioinformatic tools to cope with and process these large datasets is needed. Here, we have devised a bioinformatic pipeline that inputs SARS-CoV-2 genome sequencing in FASTA/FASTQ format and outputs a single Variant Calling Format file that can be processed to obtain variant annotations and perform downstream population genetic testing. As proof of concept, we have analyzed over 229,000 SARS-CoV-2 viral sequences up until November 30, 2020. We have identified over 39,000 variants worldwide with increased polymorphisms, spanning the ORF3a gene as well as the 3′ untranslated (UTR) regions, specifically in the conserved stem loop region of SARS-CoV-2 which is accumulating greater observed viral diversity relative to chance variation. Our analysis pipeline has also discovered the existence of SARS-CoV-2 hypermutation with low frequency (less than in 2% of genomes) likely arising through host immune responses and not due to sequencing errors. Among annotated non-sense variants with a population frequency over 1%, recurrent inactivation of the ORF8 gene was found. This was found to be present in the newly identified B.1.1.7 SARS-CoV-2 lineage that originated in the United Kingdom. Almost all VOC-containing genomes possess one stop codon in ORF8 gene (Q27∗), however, 13% of these genomes also contains another stop codon (K68∗), suggesting that ORF8 loss does not interfere with SARS-CoV-2 spread and may play a role in its increased virulence. We have developed this computational pipeline to assist researchers in the rapid analysis and characterization of SARS-CoV-2 variation.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers

10.1101/281659 ◽

2018 ◽

Cited By ~ 4

Author(s):

Chang Xu ◽

Xiujing Gu ◽

Raghavendra Padmanabhan ◽

Zhong Wu ◽

Quan Peng ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Targeted Sequencing ◽

Superior Performance ◽

Sequencing Data ◽

Background Error ◽

Fundamental Limits ◽

Sequencing Errors ◽

Coding Regions ◽

Improved Accuracy

AbstractMotivationLow-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end-repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.ResultsWe developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit at 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2’s superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.AvailabilityThe entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

10.1101/864835 ◽

2019 ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Tool Set ◽

Duplex Sequencing

AbstractDuplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences, and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics tool-set that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain PCR and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which result in unpaired families that do not form DCS. Finally, we also developed a tool called Variant Analyzer that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. We demonstrate that this tool identified false positive variants tagged by the tier-based classification. Furthermore, with this tool we can include reads without a family and check the reliability of the call, which increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

MAC5, an RNA-binding protein, protects pri-miRNAs from SERRATE-dependent exoribonuclease activities

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2008283117 ◽

2020 ◽

Vol 117 (38) ◽

pp. 23982-23990 ◽

Cited By ~ 1

Author(s):

Shengjun Li ◽

Mu Li ◽

Kan Liu ◽

Huimin Zhang ◽

Shuxin Zhang ◽

...

Keyword(s):

Rna Binding ◽

Rna Binding Protein ◽

Dual Role ◽

Mirna Biogenesis ◽

Core Component ◽

Stem Loop ◽

Loop Region ◽

Nuclear Rna ◽

Efficient Processing ◽

The Stability

MAC5 is a component of the conserved MOS4-associated complex. It plays critical roles in development and immunity. Here we report that MAC5 is required for microRNA (miRNA) biogenesis. MAC5 interacts with Serrate (SE), which is a core component of the microprocessor that processes primary miRNA transcripts (pri-miRNAs) into miRNAs and binds the stem-loop region of pri-miRNAs. MAC5 is essential for both the efficient processing and the stability of pri-miRNAs. Interestingly, the reduction of pri-miRNA levels inmac5is partially caused by XRN2/XRN3, the nuclear-localized 5′-to-3′ exoribonucleases, and depends on SE. These results reveal that MAC5 plays a dual role in promoting pri-miRNA processing and stability through its interaction with SE and/or pri-miRNAs. This study also uncovers that pri-miRNAs need to be protected from nuclear RNA decay machinery, which is connected to the microprocessor.

Download Full-text

DEEPGENTM—A Novel Variant Calling Assay for Low Frequency Variants

Genes ◽

10.3390/genes12040507 ◽

2021 ◽

Vol 12 (4) ◽

pp. 507

Author(s):

Bernd Timo Hermann ◽

Sebastian Pfeil ◽

Nicole Groenke ◽

Samuel Schaible ◽

Robert Kunze ◽

...

Keyword(s):

Cancer Detection ◽

Genetic Variants ◽

Liquid Biopsy ◽

Hot Spot ◽

Treatment Success ◽

Low Frequency ◽

Variant Calling ◽

Subsequent Treatment ◽

Precision Oncology ◽

Orthogonal Comparison

Detection of genetic variants in clinically relevant genomic hot-spot regions has become a promising application of next-generation sequencing technology in precision oncology. Effective personalized diagnostics requires the detection of variants with often very low frequencies. This can be achieved by targeted, short-read sequencing that provides high sequencing depths. However, rare genetic variants can contain crucial information for early cancer detection and subsequent treatment success, an inevitable level of background noise usually limits the accuracy of low frequency variant calling assays. To address this challenge, we developed DEEPGENTM, a variant calling assay intended for the detection of low frequency variants within liquid biopsy samples. We processed reference samples with validated mutations of known frequencies (0%–0.5%) to determine DEEPGENTM’s performance and minimal input requirements. Our findings confirm DEEPGENTM’s effectiveness in discriminating between signal and noise down to 0.09% variant allele frequency and an LOD(90) at 0.18%. A superior sensitivity was also confirmed by orthogonal comparison to a commercially available liquid biopsy-based assay for cancer detection.

Download Full-text

A small deletion and an adjacent base exchange in a potential stem-loop region of the neurofibromatosis 1 gene

Human Genetics ◽

10.1007/bf00201726 ◽

1991 ◽

Vol 87 (6) ◽

Cited By ~ 14

Author(s):

Markus Stark ◽

G�nter Assum ◽

Winfrid Krone

Keyword(s):

Neurofibromatosis 1 ◽

Stem Loop ◽

Base Exchange ◽

Small Deletion ◽

Loop Region

Download Full-text

Assessment of SARS-CoV-2 genome sequencing: quality criteria and low frequency variants

Journal of Clinical Microbiology ◽

10.1128/jcm.00944-21 ◽

2021 ◽

Author(s):

Damien Jacot ◽

Trestan Pillonel ◽

Gilbert Greub ◽

Claire Bertelli

Keyword(s):

Sample Selection ◽

Low Frequency ◽

Pcr Amplification ◽

Quality Criterion ◽

Quality Criteria ◽

Sequencing Errors ◽

Sequencing Quality ◽

Quality Control Criteria ◽

Control Criteria ◽

Sequence Quality

Although many laboratories worldwide have developed their sequencing capacities in response to the need for SARS-CoV-2 genome-based surveillance of variants, only few reported some quality criteria to ensure sequence quality before lineage assignment and submission to public databases. Hence, we aimed here to provide simple quality control criteria for SARS-CoV-2 sequencing to prevent erroneous interpretation of low quality or contaminated data. We retrospectively investigated 647 SARS-CoV-2 genomes obtained over ten tiled amplicons sequencing runs. We extracted 26 potentially relevant metrics covering the entire workflow from sample selection to bioinformatics analysis. Based on data distribution, critical values were established for eleven selected metrics to prompt further quality investigations for problematic samples, in particular those with a low viral RNA quantity. Low frequency variants (<70% of supporting reads) can result from PCR amplification errors, sample cross contaminations or presence of distinct SARS-CoV2 genomes in the sample sequenced. The number and the prevalence of low frequency variants can be used as a robust quality criterion to identify possible sequencing errors or contaminations. Overall, we propose eleven metrics with fixed cutoff values as a simple tool to evaluate the quality of SARS-CoV-2 genomes, among which cycle thresholds, mean depth, proportion of genome covered at least 10x and the number of low frequency variants combined with mutation prevalence data.

Download Full-text

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Download Full-text

Polypyrimidine tract-binding protein interacts with the 3′ stem-loop region of Japanese encephalitis virus negative-strand RNA

Virus Research ◽

10.1016/j.virusres.2005.07.013 ◽

2006 ◽

Vol 115 (2) ◽

pp. 131-140 ◽

Cited By ~ 20

Author(s):

Seong Man Kim ◽

Yong Seok Jeong

Keyword(s):

Japanese Encephalitis Virus ◽

Japanese Encephalitis ◽

Binding Protein ◽

Encephalitis Virus ◽

Polypyrimidine Tract ◽

Stem Loop ◽

Negative Strand ◽

Polypyrimidine Tract Binding ◽

Polypyrimidine Tract Binding Protein ◽

Loop Region

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text