GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning

ABSTRACTIntroductionMultidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens.MethodsWe present Translational Genomics platform for Tuberculosis (GenTB), a web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data.ResultsAll four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs was 77.6% (95% CI 76.6 - 78.5%) and 75.4% (95% CI 74.5 - 76.4%) respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4 - 75.3%) and Mykrobe at 71.9% (95% CI 70.9 - 72.9%). The higher sensitivities were at an expense of ≤1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5 - 97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (<10x across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variantsConclusionGenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site.

Download Full-text

GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning

Genome Medicine ◽

10.1186/s13073-021-00953-4 ◽

2021 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Matthias I. Gröschel ◽

Martin Owens ◽

Luca Freschi ◽

Roger Vargas ◽

Maximilian G. Marin ◽

...

Keyword(s):

Public Health ◽

Dna Sequences ◽

Sequence Data ◽

Drug Susceptibility ◽

Control Input ◽

Genotypic Resistance ◽

Prediction Tools ◽

Link Type ◽

Resistance Prediction ◽

User Friendly

Abstract Background Multidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens. Results We present Translational Genomics platform for Tuberculosis (GenTB), a free and open web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data. All four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs were 77.6% (95% CI 76.6–78.5%) and 75.4% (95% CI 74.5–76.4%), respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4–75.3%) and Mykrobe at 71.9% (95% CI 70.9–72.9%). The higher sensitivities were at an expense of ≤ 1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5–97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Averaged across the four tools, genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (< 10× across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variants Conclusions GenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site.

Download Full-text

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

Taxonomic identification from metagenomic and metabarcoding data using any genetic marker

10.1101/253377 ◽

2018 ◽

Author(s):

Johan Bengtsson-Palme ◽

Rodney T. Richardson ◽

Marco Meola ◽

Christian Wurzbacher ◽

Émilie D. Tremblay ◽

...

Keyword(s):

Genetic Marker ◽

Dna Sequences ◽

Sequence Data ◽

Taxonomic Diversity ◽

Taxonomic Classification ◽

Taxonomic Identification ◽

Link Type

Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, there is no genetic marker that gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. We here present a major update to Metaxa2 (http://microbiology.se/software/metaxa2/) that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.

Download Full-text

EMBL2checklists: A Python package to facilitate the user-friendly submission of plant DNA barcoding sequences to ENA

10.1101/435644 ◽

2018 ◽

Author(s):

Michael Gruenstaeudl ◽

Yannick Hartmaring

Keyword(s):

Dna Barcoding ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Software Tool ◽

Plant Dna ◽

Dna Sequence Data ◽

User Friendly ◽

Common Plant ◽

Python Package

AbstractBackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.

Download Full-text

WeFaceNano: a user-friendly pipeline for complete ONT sequence assembly and detection of antibiotic resistance in multi-plasmid bacterial isolates

BMC Microbiology ◽

10.1186/s12866-021-02225-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Astrid P. Heikema ◽

Rick Jansen ◽

Saskia D. Hiltemann ◽

John P. Hays ◽

Andrew P. Stubbs

Keyword(s):

Antibiotic Resistance ◽

Resistance Genes ◽

Dna Sequences ◽

Sequence Data ◽

Antibiotic Resistance Genes ◽

Clinical Samples ◽

Microbial Resistance ◽

Sequencing Platform ◽

Long Read ◽

User Friendly

Abstract Background Bacterial plasmids often carry antibiotic resistance genes and are a significant factor in the spread of antibiotic resistance. The ability to completely assemble plasmid sequences would facilitate the localization of antibiotic resistance genes, the identification of genes that promote plasmid transmission and the accurate tracking of plasmid mobility. However, the complete assembly of plasmid sequences using the currently most widely used sequencing platform (Illumina-based sequencing) is restricted due to the generation of short sequence lengths. The long-read Oxford Nanopore Technologies (ONT) sequencing platform overcomes this limitation. Still, the assembly of plasmid sequence data remains challenging due to software incompatibility with long-reads and the error rate generated using ONT sequencing. Bioinformatics pipelines have been developed for ONT-generated sequencing but require computational skills that frequently are beyond the abilities of scientific researchers. To overcome this challenge, the authors developed ‘WeFaceNano’, a user-friendly Web interFace for rapid assembly and analysis of plasmid DNA sequences generated using the ONT platform. WeFaceNano includes: a read statistics report; two assemblers (Miniasm and Flye); BLAST searching; the detection of antibiotic resistance- and replicon genes and several plasmid visualizations. A user-friendly interface displays the main features of WeFaceNano and gives access to the analysis tools. Results Publicly available ONT sequence data of 21 plasmids were used to validate WeFaceNano, with plasmid assemblages and anti-microbial resistance gene detection being concordant with the published results. Interestingly, the “Flye” assembler with “meta” settings generated the most complete plasmids. Conclusions WeFaceNano is a user-friendly open-source software pipeline suitable for accurate plasmid assembly and the detection of anti-microbial resistance genes in (clinical) samples where multiple plasmids can be present.

Download Full-text

GalaxyTrakr: a distributed analysis tool for public health whole genome sequence data accessible to non-bioinformaticians

BMC Genomics ◽

10.1186/s12864-021-07405-8 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Jayanthi Gangiredla ◽

Hugh Rand ◽

Daniel Benisatto ◽

Justin Payne ◽

Charles Strittmatter ◽

...

Keyword(s):

Public Health ◽

Food Safety ◽

Data Storage ◽

Sequence Data ◽

Whole Genome Sequence ◽

Analysis Tool ◽

Whole Genome ◽

User Friendliness ◽

Computing Power ◽

Link Type

Abstract Background Processing and analyzing whole genome sequencing (WGS) is computationally intense: a single Illumina MiSeq WGS run produces ~ 1 million 250-base-pair reads for each of 24 samples. This poses significant obstacles for smaller laboratories, or laboratories not affiliated with larger projects, which may not have dedicated bioinformatics staff or computing power to effectively use genomic data to protect public health. Building on the success of the cloud-based Galaxy bioinformatics platform (http://galaxyproject.org), already known for its user-friendliness and powerful WGS analytical tools, the Center for Food Safety and Applied Nutrition (CFSAN) at the U.S. Food and Drug Administration (FDA) created a customized ‘instance’ of the Galaxy environment, called GalaxyTrakr (https://www.galaxytrakr.org), for use by laboratory scientists performing food-safety regulatory research. The goal was to enable laboratories outside of the FDA internal network to (1) perform quality assessments of sequence data, (2) identify links between clinical isolates and positive food/environmental samples, including those at the National Center for Biotechnology Information sequence read archive (https://www.ncbi.nlm.nih.gov/sra/), and (3) explore new methodologies such as metagenomics. GalaxyTrakr hosts a variety of free and adaptable tools and provides the data storage and computing power to run the tools. These tools support coordinated analytic methods and consistent interpretation of results across laboratories. Users can create and share tools for their specific needs and use sequence data generated locally and elsewhere. Results In its first full year (2018), GalaxyTrakr processed over 85,000 jobs and went from 25 to 250 users, representing 53 different public and state health laboratories, academic institutions, international health laboratories, and federal organizations. By mid-2020, it has grown to 600 registered users and processed over 450,000 analytical jobs. To illustrate how laboratories are making use of this resource, we describe how six institutions use GalaxyTrakr to quickly analyze and review their data. Instructions for participating in GalaxyTrakr are provided. Conclusions GalaxyTrakr advances food safety by providing reliable and harmonized WGS analyses for public health laboratories and promoting collaboration across laboratories with differing resources. Anticipated enhancements to this resource will include workflows for additional foodborne pathogens, viruses, and parasites, as well as new tools and services.

Download Full-text

Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe

Wellcome Open Research ◽

10.12688/wellcomeopenres.15603.1 ◽

2019 ◽

Vol 4 ◽

pp. 191 ◽

Cited By ~ 7

Author(s):

Martin Hunt ◽

Phelim Bradley ◽

Simon Grandjean Lapierre ◽

Simon Heys ◽

Mark Thomsit ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Sequence Data ◽

Universal Access ◽

Independent Set ◽

Drug Susceptibility ◽

Drug Susceptibility Testing ◽

Error Rates ◽

World Health ◽

Sequencing Data ◽

Resistance Prediction

Two billion people are infected with Mycobacterium tuberculosis, leading to 10 million new cases of active tuberculosis and 1.5 million deaths annually. Universal access to drug susceptibility testing (DST) has become a World Health Organization priority. We previously developed a software tool, Mykrobe predictor, which provided offline species identification and drug resistance predictions for M. tuberculosis from whole genome sequencing (WGS) data. Performance was insufficient to support the use of WGS as an alternative to conventional phenotype-based DST, due to mutation catalogue limitations. Here we present a new tool, Mykrobe, which provides the same functionality based on a new software implementation. Improvements include i) an updated mutation catalogue giving greater sensitivity to detect pyrazinamide resistance, ii) support for user-defined resistance catalogues, iii) improved identification of non-tuberculous mycobacterial species, and iv) an updated statistical model for Oxford Nanopore Technologies sequencing data. Mykrobe is released under MIT license at https://github.com/mykrobe-tools/mykrobe. We incorporate mutation catalogues from the CRyPTIC consortium et al. (2018) and from Walker et al. (2015), and make improvements based on performance on an initial set of 3206 and an independent set of 5845 M. tuberculosis Illumina sequences. To give estimates of error rates, we use a prospectively collected dataset of 4362 M. tuberculosis isolates. Using culture based DST as the reference, we estimate Mykrobe to be 100%, 95%, 82%, 99% sensitive and 99%, 100%, 99%, 99% specific for rifampicin, isoniazid, pyrazinamide and ethambutol resistance prediction respectively. We benchmark against four other tools on 10207 (=5845+4362) samples, and also show that Mykrobe gives concordant results with nanopore data. We measure the ability of Mykrobe-based DST to guide personalized therapeutic regimen design in the context of complex drug susceptibility profiles, showing 94% concordance of implied regimen with that driven by phenotypic DST, higher than all other benchmarked tools.

Download Full-text

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Microbial Genomics ◽

10.1099/mgen.0.000651 ◽

2021 ◽

Vol 7 (9) ◽

Author(s):

Geneviève Labbé ◽

Peter Kruczkiewicz ◽

James Robertson ◽

Philip Mabon ◽

Justin Schonfeld ◽

...

Keyword(s):

Public Health ◽

Genetic Diversity ◽

Quality Assurance ◽

Phylogenetic Trees ◽

Sequence Data ◽

Bacterial Pathogens ◽

Snp Genotyping ◽

Content Type ◽

Link Type ◽

Low Genetic Diversity

Hierarchical genotyping approaches can provide insights into the source, geography and temporal distribution of bacterial pathogens. Multiple hierarchical SNP genotyping schemes have previously been developed so that new isolates can rapidly be placed within pre-computed population structures, without the need to rebuild phylogenetic trees for the entire dataset. This classification approach has, however, seen limited uptake in routine public health settings due to analytical complexity and the lack of standardized tools that provide clear and easy ways to interpret results. The BioHansel tool was developed to provide an organism-agnostic tool for hierarchical SNP-based genotyping. The tool identifies split k-mers that distinguish predefined lineages in whole genome sequencing (WGS) data using SNP-based genotyping schemes. BioHansel uses the Aho-Corasick algorithm to type isolates from assembled genomes or raw read sequence data in a matter of seconds, with limited computational resources. This makes BioHansel ideal for use by public health agencies that rely on WGS methods for surveillance of bacterial pathogens. Genotyping results are evaluated using a quality assurance module which identifies problematic samples, such as low-quality or contaminated datasets. Using existing hierarchical SNP schemes for Mycobacterium tuberculosis and Salmonella Typhi, we compare the genotyping results obtained with the k-mer-based tools BioHansel and SKA, with those of the organism-specific tools TBProfiler and genotyphi, which use gold-standard reference-mapping approaches. We show that the genotyping results are fully concordant across these different methods, and that the k-mer-based tools are significantly faster. We also test the ability of the BioHansel quality assurance module to detect intra-lineage contamination and demonstrate that it is effective, even in populations with low genetic diversity. We demonstrate the scalability of the tool using a dataset of ~8100 S. Typhi public genomes and provide the aggregated results of geographical distributions as part of the tool’s output. BioHansel is an open source Python 3 application available on PyPI and Conda repositories and as a Galaxy tool from the public Galaxy Toolshed. In a public health context, BioHansel enables rapid and high-resolution classification of bacterial pathogens with low genetic diversity.

Download Full-text

Pedigree and Pedigree Import Wizard

HortScience ◽

10.21273/hortsci.33.3.552g ◽

1998 ◽

Vol 33 (3) ◽

pp. 552g-553

Author(s):

Shahrokh Khandizadeh

Keyword(s):

Additional Data ◽

File Format ◽

Fruit Crops ◽

Operating Environment ◽

Agronomic Characteristics ◽

Link Type ◽

Plant Characteristics ◽

User Friendly

Pedigree for Windows is a user-friendly program that allows the user to trace agronomic characteristics, draw pedigrees, and view images of several fruit crops, including more than 1400 apple, 800 strawberry, 800 almond, 100 blackberry, 80 blueberry, 790 pear, 200 raspberry examples. Pedigree Import Wizard®© for Windows is an add-on software for users who are interested in importing their research or breeding data records of fruit, flower, and plant characteristics and any related images into Pedigree for Windows. Pedigree for Windows and Pedigree Import Wizard have been designed so that a user familiar with the Windows operating environment should have little need to refer to the documentation provided with the program. Pedigree Import Wizard uses a comma-separated value (csv) file format under the MS Excel environment. This option allows the user to add or import additional data to the existing database that are already stored in other software such as Lotus, Excel, Access, QuattroPro, WordPerfect, and MS Word tables, etc., as long as they work under the Windows environment. A free demo version of Pedigree and Pedigree Import Wizard for Windows is available from http://www.pgris.com.

Download Full-text

Challenges in evaluating the use of viral sequence data to identify HIV transmission networks for public health

Statistical Communications in Infectious Diseases ◽

10.1515/scid-2019-0019 ◽

2020 ◽

Vol 12 (s1) ◽

Author(s):

Rami Kantor ◽

John P. Fulton ◽

Jon Steingrimsson ◽

Vladimir Novitsky ◽

Mark Howison ◽

...

Keyword(s):

Public Health ◽

United States ◽

Hiv Transmission ◽

Sequence Data ◽

The United States ◽

Viral Sequence ◽

Transmission Networks ◽

New Methods ◽

Hiv Epidemic ◽

The World

AbstractGreat efforts are devoted to end the HIV epidemic as it continues to have profound public health consequences in the United States and throughout the world, and new interventions and strategies are continuously needed. The use of HIV sequence data to infer transmission networks holds much promise to direct public heath interventions where they are most needed. As these new methods are being implemented, evaluating their benefits is essential. In this paper, we recognize challenges associated with such evaluation, and make the case that overcoming these challenges is key to the use of HIV sequence data in routine public health actions to disrupt HIV transmission networks.

Download Full-text