Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

Mapping Intimacies ◽

10.1101/2022.01.11.475979 ◽

2022 ◽

Author(s):

Kumeren Nadaraj Govender ◽

David W Eyre

Keyword(s):

Clinical Decision Making ◽

Sequence Data ◽

Clinical Decision ◽

Classification Performance ◽

Error Rates ◽

Species Level ◽

Sequencing Error ◽

Sequencing Data ◽

Real World Data ◽

Median Percentage

Culture-independent metagenomic detection of microbial species has the potential to provide rapid and precise real-time diagnostic results. However, it is potentially limited by sequencing and classification errors. We use simulated and real-world data to benchmark rates of species misclassification using 100 reference genomes for each of ten common bloodstream pathogens and six frequent blood culture contaminants (n=1600). Simulating both with and without sequencing error for both the Illumina and Oxford Nanopore platforms, we evaluated commonly used classification tools including Kraken2, Bracken, and Centrifuge, utilising mini (8GB) and standard (30-50GB) databases. Bracken with the standard database performed best, the median percentage of reads across both sequencing platforms identified correctly to the species level was 98.46% (IQR 93.0:99.3) [range 57.1:100]. For Kraken2 with a mini database, a commonly used combination, median species-level identification was 79.3% (IQR 39.1:88.8) [range 11.2:100]. Classification performance varied by species, with E. coli being more challenging to classify correctly (59.4% to 96.4% reads with correct species, varying by tool used). By filtering out shorter Nanopore reads (<3500bp) we found performance similar or superior to Illumina sequencing, despite higher sequencing error rates. Misclassification was more common when the misclassified species had a higher average nucleotide identity to the true species. Our findings highlight taxonomic misclassification of sequencing data occurs and varies by sequencing and analysis workflow. This “bioinformatic contamination” should be accounted for in metagenomic pipelines to ensure accurate results that can support clinical decision making.

Download Full-text

Nanopanel2 calls phased low-frequency variants in Nanopore panel sequencing data

10.1101/2020.11.06.370858 ◽

2020 ◽

Author(s):

Niko Popitsch ◽

Sandra Preuner ◽

Thomas Lion

Keyword(s):

False Positive ◽

Clinical Decision Making ◽

Probability Distributions ◽

Low Frequency ◽

Clinical Decision ◽

Error Rates ◽

Sequencing Data ◽

Long Read ◽

Genomic Regions ◽

Panel Sequencing

Clinical decision making is increasingly guided by accurate and recurrent determination of presence and frequency of (somatic) variants and their haplotype through panel sequencing of disease-relevant genomic regions. Haplotype calling (phasing), however, is difficult and error prone unless variants are located on the same read which limits the ability of short-read sequencing to detect, e.g., co-occurrence of drug-resistance variants. Long-read panel sequencing enables direct phasing of amplicon variants besides having multiple other benefits, however, high error rates of current technologies prevented their applicability in the past. We have developed nanopanel2 (np2), a variant caller for Nanopore panel sequencing data. Np2 works directly on base-called FAST5 files and uses allele probability distributions and several other filters to robustly separate true from false positive calls. It effectively calls SNVs and INDELs with variant allele frequencies (VAF) as low as 1% and 5% respectively and produces only few low-frequency false-positive calls. Haplotype compositions are then determined by direct phasing. Np2 is the first somatic variant caller for Nanopore data, enabling accurate, fast (turnaround <48h) and cheap (sequencing costs ~10$/sample) diagnostic workflows.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Ondansetron use is associated with lower COVID-19 mortality in a Real-World Data network-based analysis

10.1101/2021.10.05.21264578 ◽

2021 ◽

Author(s):

Gregory M Miller ◽

Austin J Ellis ◽

Rangaprasad Sarangarajan ◽

Amay Parikh ◽

Leonardo O Rodrigues ◽

...

Keyword(s):

Artificial Intelligence ◽

Real World ◽

Clinical Decision Making ◽

Clinical Decision ◽

Long Term Effects ◽

Real World Data ◽

World Data ◽

Antiemetic Agent ◽

Future Loss ◽

Selection Operator

Objective: The COVID-19 pandemic generated a massive amount of clinical data, which potentially holds yet undiscovered answers related to COVID-19 morbidity, mortality, long term effects, and therapeutic solutions. The objective of this study was to generate insights on COVID-19 mortality-associated factors and identify potential new therapeutic options for COVID-19 patients by employing artificial intelligence analytics on real-world data. Materials and Methods: A Bayesian statistics-based artificial intelligence data analytics tool (bAIcis®) within Interrogative Biology® platform was used for network learning, inference causality and hypothesis generation to analyze 16,277 PCR positive patients from a database of 279,281 inpatients and outpatients tested for SARS-CoV-2 infection by antigen, antibody, or PCR methods during the first pandemic year in Central Florida. This approach generated causal networks that enabled unbiased identification of significant predictors of mortality for specific COVID-19 patient populations. These findings were validated by logistic regression, regression by least absolute shrinkage and selection operator, and bootstrapping. Results: We found that in the SARS-CoV-2 PCR positive patient cohort, early use of the antiemetic agent ondansetron was associated with increased survival in mechanically ventilated patients. Conclusions: The results demonstrate how real world COVID-19 focused data analysis using artificial intelligence can generate valid insights that could possibly support clinical decision-making and minimize the future loss of lives and resources.

Download Full-text

SGPNet: A Three-Dimensional Multitask Residual Framework for Segmentation and IDH Genotype Prediction of Gliomas

Computational Intelligence and Neuroscience ◽

10.1155/2021/5520281 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Yao Wang ◽

Yan Wang ◽

Chunjie Guo ◽

Shuangquan Zhang ◽

Lili Yang

Keyword(s):

Brain Tumor ◽

Clinical Decision Making ◽

Three Dimensional ◽

Clinical Decision ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Classification Error ◽

Tumor Segmentation ◽

Brain Tumor Segmentation ◽

The Status

Glioma is the main type of malignant brain tumor in adults, and the status of isocitrate dehydrogenase (IDH) mutation highly affects the diagnosis, treatment, and prognosis of gliomas. Radiographic medical imaging provides a noninvasive platform for sampling both inter and intralesion heterogeneity of gliomas, and previous research has shown that the IDH genotype can be predicted from the fusion of multimodality radiology images. The features of medical images and IDH genotype are vital for medical treatment; however, it still lacks a multitask framework for the segmentation of the lesion areas of gliomas and the prediction of IDH genotype. In this paper, we propose a novel three-dimensional (3D) multitask deep learning model for segmentation and genotype prediction (SGPNet). The residual units are also introduced into the SGPNet that allows the output blocks to extract hierarchical features for different tasks and facilitate the information propagation. Our model reduces 26.6% classification error rates comparing with previous models on the datasets of Multimodal Brain Tumor Segmentation Challenge (BRATS) 2020 and The Cancer Genome Atlas (TCGA) gliomas’ databases. Furthermore, we first practically investigate the influence of lesion areas on the performance of IDH genotype prediction by setting different groups of learning targets. The experimental results indicate that the information of lesion areas is more important for the IDH genotype prediction. Our framework is effective and generalizable, which can serve as a highly automated tool to be applied in clinical decision making.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

SwarmTCR: a computational approach to predict the specificity of T Cell Receptors

10.1101/2020.11.05.370312 ◽

2020 ◽

Author(s):

Ryan Ehrlich ◽

Larisa Kamga ◽

Anna Gil ◽

Katherine Luzuriaga ◽

Liisa Selin ◽

...

Keyword(s):

T Cell ◽

T Cell Receptors ◽

Nearest Neighbor ◽

Sequence Data ◽

Cell Receptor ◽

Classification Performance ◽

Sequencing Data ◽

Cell Receptors ◽

The Individual ◽

Cell Data

AbstractMotivationComputationally predicting the specificity of T cell receptors can be a powerful tool to shed light on the immune response against infectious diseases and cancers, autoimmunity, cancer immunotherapy, and immunopathology. With more T cell receptor sequence data becoming available, the need for bioinformatics approaches to tackle this problem is even more pressing. Here we present SwarmTCR, a method that uses labeled sequence data to predict the specificity of T cell receptors using a nearest-neighbor approach. SwarmTCR works by optimizing the weights of the individual CDR regions to maximize classification performance.ResultsWe compared the performance of SwarmTCR against a state-of-the-art method (TCRdist) and showed that SwarmTCR performed significantly better on epitopes EBV-BRLF1300, EBV-BRLF1109, NS4B214–222 with single cell data and epitopes EBV-BRLF1300, EBV-BRLF1109, IAV-M158 with bulk sequencing data (α and β chains). In addition, we show that the weights returned by SwarmTCR are biologically interpretable.AvailabilitySwarmTCR is distributed freely under the terms of the GPL-3 license. The source code and all sequencing data are available at GitHub (https://github.com/thecodingdoc/SwarmTCR)[email protected]

Download Full-text

The Impact of COVID-19 on the Delivery of Systemic Anti-Cancer Treatment at Guy’s Cancer Centre

Cancers ◽

10.3390/cancers14020266 ◽

2022 ◽

Vol 14 (2) ◽

pp. 266

Author(s):

Beth Russell ◽

Charlotte Moss ◽

Eirini Tsotra ◽

Charalampos Gousis ◽

Debra Josephs ◽

...

Keyword(s):

Cancer Treatment ◽

Clinical Decision Making ◽

Stage Iv ◽

Clinical Decision ◽

Study Data ◽

Real World Data ◽

Cancer Centre ◽

Treatment Type ◽

Anti Cancer ◽

The Impact

Background: This study aimed to assess the outcome of cancer patients undergoing systemic anti-cancer treatment (SACT) at our centre to help inform future clinical decision-making around SACT during the COVID-19 pandemic. Methods: Patients receiving at least one episode of SACT for solid tumours at Guy’s Cancer Centre between 1 March and 31 May 2020 and the same period in 2019 were included in the study. Data were collected on demographics, tumour type/stage, treatment type (chemotherapy, immunotherapy, biological-targeted) and SARS-CoV2 infection. Results: A total of 2120 patients received SACT in 2020, compared to 2449 in 2019 (13% decrease). From 2019 to 2020, there was an increase in stage IV disease (62% vs. 72%), decrease in chemotherapy (42% vs. 34%), increase in immunotherapy (6% vs. 10%), but similar rates of biologically targeted treatments (37% vs. 38%). There was a significant increase in 1st and 2nd line treatments in 2020 (68% vs. 81%; p < 0.0001) and reduction in 3rd and subsequent lines (26% vs. 15%; p = 0.004) compared to 2019. Of the 2020 cohort, 2% patients developed SARS-CoV2 infections. Conclusions: These real-world data from a tertiary Cancer Centre suggest that despite the challenges faced due to the COVID-19 pandemic, SACT was able to be continued without any significant effects on the mortality of solid-tumour patients. There was a low rate (2%) of SARS-CoV-2 infection which is comparable to the 1.4%-point prevalence in our total cancer population.

Download Full-text

FaStore – a space-saving solution for raw sequencing data

10.1101/168096 ◽

2017 ◽

Cited By ~ 1

Author(s):

Łukasz Roguski ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Sebastian Deorowicz

Keyword(s):

Decision Making ◽

Compression Ratio ◽

Clinical Decision Making ◽

Variant Calling ◽

Clinical Decision ◽

Genomic Information ◽

Sequencing Data ◽

Space Saving ◽

Reference Sequences ◽

Significant Compression

AbstractThe affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed, and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. The proposed algorithm does not use any reference sequences for compression, and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. We demonstrate through extensive simulations that FaStore achieves a significant improvement in compression ratio with respect to previously proposed algorithms for this task. In addition, we perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.

Download Full-text

Predicting Cardiovascular Risk in Athletes: Resampling Improves Classification Performance

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17217923 ◽

2020 ◽

Vol 17 (21) ◽

pp. 7923

Author(s):

Davide Barbieri ◽

Nitesh Chawla ◽

Luciana Zaccagni ◽

Tonći Grgurinović ◽

Jelena Šarac ◽

...

Keyword(s):

Data Mining ◽

Decision Making ◽

Cardiovascular Risk ◽

Clinical Decision Making ◽

Characteristic Curve ◽

Area Under The Curve ◽

Clinical Decision ◽

Classification Performance ◽

Biomedical Data ◽

Sport Practice

Cardiovascular diseases are the main cause of death worldwide. The aim of the present study is to verify the performances of a data mining methodology in the evaluation of cardiovascular risk in athletes, and whether the results may be used to support clinical decision making. Anthropometric (height and weight), demographic (age and sex) and biomedical (blood pressure and pulse rate) data of 26,002 athletes were collected in 2012 during routine sport medical examinations, which included electrocardiography at rest. Subjects were involved in competitive sport practice, for which medical clearance was needed. Outcomes were negative for the largest majority, as expected in an active population. Resampling was applied to balance positive/negative class ratio. A decision tree and logistic regression were used to classify individuals as either at risk or not. The receiver operating characteristic curve was used to assess classification performances. Data mining and resampling improved cardiovascular risk assessment in terms of increased area under the curve. The proposed methodology can be effectively applied to biomedical data in order to optimize clinical decision making, and—at the same time—minimize the amount of unnecessary examinations.

Download Full-text

Stratification of patients by tumor type using molecular profiling in real-world data.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.15_suppl.e19262 ◽

2020 ◽

Vol 38 (15_suppl) ◽

pp. e19262-e19262

Author(s):

Alyssa Antonopoulos ◽

Elizabeth Eldridge ◽

George Managadze ◽

Elia Stupka ◽

Hakim Lakhani ◽

...

Keyword(s):

Real World ◽

Clinical Decision Making ◽

Single Gene ◽

Data Access ◽

Clinical Decision ◽

Molecular Data ◽

Tumor Type ◽

Real World Data ◽

Number Of Patients ◽

Biomarker Data

e19262 Background: While Next-Generation Sequencing (NGS) tests become increasingly more common for diagnosis, molecular characterization, and treatment, a significant amount of molecular data derives from single-gene or analyte tests. Single gene test information is stored in disparate sources including electronic medical record (EMR) and data access for clinical use remains a challenge. A solution that harmonizes biomarker data beyond standard NGS-centric data and linked to rich clinical data is required for the complete patient picture. Methods: Health Catalyst’s extended real-world database, Touchstone includes a molecular data mart that integrates data from provider and life sciences proprietary NGS panels, Laboratory Information Systems, and other repositories. A portion of the data is derived from single-gene tests documented in the EMR. Biomarker data from EMRs was extracted from six health systems via a proprietary pipeline for extracting biomarker data. The algorithm relies on a curated ontology for molecular terms and publicly available terminologies for human genetics. Minor transformations extract pertinent variant information where available to harmonize with NGS-level data. Results: Over 44 thousand molecular labs from over 24 thousand patients were identified with this method. The oncology classes for which molecular data was identified in the greatest number of patients include skin, hematological, breast, digestive, and lung cancers (Table). PRTN3, EGFR, BRAF, JAK2, ERBB2, and KRAS are among the most commonly tested genes. Conclusions: Integrated real-world clinical and biomarker data from single gene tests can inform clinical decision-making and support clinical trial recruitment across a broader set of patient population. [Table: see text]

Download Full-text