BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing

AbstractThe future of human genomics is one that seeks to resolve the entirety of genetic variation through sequencing. The prospect of utilizing genomics for medical purposes require cost-efficient and accurate base calling, long-range haplotyping capability, and reliable calling of structural variants. Short-read sequencing has lead the development towards such a future but has struggled to meet the latter two of these needs. To address this limitation, we developed a technology that preserves the molecular origin of short sequencing reads, with an insignificant increase to sequencing costs. We demonstrate a novel library preparation method for high throughput barcoding of short reads where millions of random barcodes can be used to reconstruct megabase-scale phase blocks.

Download Full-text

High-throughput Interpretation of Killer-cell Immunoglobulin-like Receptor Short-read Sequencing Data with PING

10.1101/2021.03.24.436770 ◽

2021 ◽

Author(s):

Wesley Marin ◽

Ravi Dandekar ◽

Danillo G. Augusto ◽

Tasneem Yusufali ◽

Bianca Heyn ◽

...

Keyword(s):

High Resolution ◽

High Throughput ◽

Copy Number ◽

Killer Cell ◽

Gene Copy Number ◽

Gene Copy ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Kir Genes

The killer-cell immunoglobulin-like receptor ( KIR) complex on chromosome 19 encodes receptors that modulate the activity of natural killer cells, and variation in these genes has been linked to infectious and autoimmune disease, as well as having bearing on pregnancy and transplant outcomes. The medical relevance and high variability of KIR genes makes short-read sequencing an attractive technology for interrogating the region, providing a high-throughput, high-fidelity sequencing method that is cost-effective. However, because this gene complex is characterized by extensive nucleotide polymorphism, structural variation including gene fusions and deletions, and a high level of homology between genes, its interrogation at high resolution has been thwarted by bioinformatic challenges, with most studies limited to examining presence or absence of specific genes. Here, we present the PING (Pushing Immunogenetics to the Next Generation) pipeline, which incorporates empirical data, novel alignment strategies and a custom alignment processing workflow to enable high-throughput KIR sequence analysis from short-read data. PING provides KIR gene copy number classification functionality for all KIR genes through use of a comprehensive alignment reference. The gene copy number determined per individual enables an innovative genotype determination workflow using genotype-matched references. Together, these methods address the challenges imposed by the structural complexity and overall homology of the KIR complex. To determine copy number and genotype determination accuracy, we applied PING to European and African validation cohorts and a synthetic dataset. PING demonstrated exceptional copy number determination performance across all datasets and robust genotype determination performance. Finally, an investigation into discordant genotypes for the synthetic dataset provides insight into misaligned reads, advancing our understanding in interpretation of short-read sequencing data in complex genomic regions. PING promises to support a new era of studies of KIR polymorphism, delivering high-resolution KIR genotypes that are highly accurate, enabling high-quality, high-throughput KIR genotyping for disease and population studies.

Download Full-text

A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1713 ◽

2012 ◽

Vol 11 (2) ◽

Cited By ~ 11

Author(s):

Reed A. Cartwright ◽

Julie Hussin ◽

Jonathan E. M. Keebler ◽

Eric A. Stone ◽

Philip Awadalla

Keyword(s):

High Throughput ◽

De Novo ◽

Probabilistic Method ◽

Sequencing Data ◽

De Novo Mutations ◽

Short Read ◽

Short Read Sequencing ◽

Family Based

Download Full-text

Highly accurate barcode and UMI error correction using dual nucleotide dimer blocks allows direct single-cell nanopore transcriptome sequencing

10.1101/2021.01.18.427145 ◽

2021 ◽

Author(s):

Martin Philpott ◽

Jonathan Watson ◽

Anjan Thakurta ◽

Tom Brown ◽

...

Keyword(s):

Single Cell ◽

Nanopore Sequencing ◽

Short Read ◽

Short Read Sequencing ◽

Single Cell Sequencing ◽

Base Calling ◽

Novel Approach ◽

Long Read ◽

First Time ◽

Insight Into

AbstractDroplet-based single-cell sequencing techniques have provided unprecedented insight into cellular heterogeneities within tissues. However, these approaches only allow for the measurement of the distal parts of a transcript following short-read sequencing. Therefore, splicing and sequence diversity information is lost for the majority of the transcript. The application of long-read Nanopore sequencing to droplet-based methods is challenging because of the low base-calling accuracy currently associated with Nanopore sequencing. Although several approaches that use additional short-read sequencing to error-correct the barcode and UMI sequences have been developed, these techniques are limited by the requirement to sequence a library using both short- and long-read sequencing. Here we introduce a novel approach termed single-cell Barcode UMI Correction sequencing (scBUC-seq) to efficiently error-correct barcode and UMI oligonucleotide sequences synthesized by using blocks of dimeric nucleotides. The method can be applied to correct either short-read or long-read sequencing, thereby allowing users to recover more reads per cell and permits direct single-cell Nanopore sequencing for the first time. We illustrate our method by using species-mixing experiments to evaluate barcode assignment accuracy and evaluate differential isoform usage and fusion transcripts using myeloma and sarcoma cell line models.

Download Full-text

Characterization and mitigation of fragmentation enzyme-induced dual stranded artifacts

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa070 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Thomas Gregory ◽

Apollinaire Ngankeu ◽

Shelley Orwick ◽

Esko A Kautto ◽

Jennifer A Woyach ◽

...

Keyword(s):

Nucleic Acid ◽

Rare Variant ◽

High Throughput ◽

Allele Frequencies ◽

Optimal Sampling ◽

Artifact Detection ◽

Short Read ◽

Short Read Sequencing ◽

Variant Discovery ◽

Downstream Analysis

Abstract High-throughput short-read sequencing relies on fragmented DNA for optimal sampling of input nucleic acid. Several vendors now offer proprietary enzyme cocktails as a cheaper and more streamlined method of fragmentation when compared to acoustic shearing. We have discovered that these enzymes induce the formation of library molecules containing regions of nearby DNA from opposite strands. Sequencing reads derived from these molecules can lead to artifact-derived variant calls appearing at variant allele frequencies <5%. We present Fragmentation Artifact Detection and Elimination (FADE), software to remove these artifacts from mapped reads and mitigate artifact-related effects on downstream analysis. We find that the artifacts principally affect downstream analyses that are sensitive to a 1–3% artifact bias in the sequencing reads, such as targeted resequencing and rare variant discovery.

Download Full-text