MOST: A modified MLST typing tool based on short read sequencing

Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets fromSalmonella enteridisandStreptococcus pneumoniae. Of the 323 samples, 92.9% (n= 300), 97.5% (n= 315) and 99.7% (n= 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n= 49) and 67.3% (n= 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches.

Download Full-text

MOST: A modified MLST typing tool based on short read sequencing

10.7287/peerj.preprints.1971v1 ◽

2016 ◽

Author(s):

Rediat Tewolde ◽

Timothy Dallman ◽

Ulf Schaefer ◽

Carmen L Sheppard ◽

Philip Ashton ◽

...

Keyword(s):

Conventional Method ◽

Sequence Data ◽

Pcr Amplification ◽

Housekeeping Genes ◽

Data Sets ◽

Bacterial Genomes ◽

Bacterial Populations ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR)amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 325 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets from Salmonella enteridis and Streptococcus pneumoniae. Of the 325 samples, 92.9% (n=302), 97.2% (n=316) and 99.7% (n=324) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 90.9% (n=50) and 67.3% (n=37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches.

Download Full-text

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data

10.1101/016345 ◽

2015 ◽

Author(s):

Jane Hawkey ◽

Mohammad Hamidian ◽

Ryan R Wick ◽

David J Edwards ◽

Helen Billman-Jacobe ◽

...

Keyword(s):

Sequence Data ◽

Insertion Sequences ◽

Bacterial Genomes ◽

Short Read ◽

Phenotypic Resistance ◽

Genome Wide ◽

Wide Range ◽

Multiple Copies ◽

Short Read Sequence ◽

Insertion Sites

Background Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, direct from paired-end short read data. Results ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 96% of known IS insertions in the analysis of simulated reads, and 98% in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was reliable for average genome-wide read depths >20x. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots. Conclusions ISMapper provides a rapid and robust method for identifying IS insertion sites direct from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

Download Full-text

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Genes ◽

10.3390/genes11020141 ◽

2020 ◽

Vol 11 (2) ◽

pp. 141 ◽

Cited By ~ 5

Author(s):

Feichen Shen ◽

Jeffrey M. Kidd

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Sequence Data ◽

Data Sets ◽

Short Read ◽

Major Mechanism ◽

Rapid Construction ◽

A Genome ◽

Number Variation ◽

Short Read Sequence

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Download Full-text

De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae

Genome Research ◽

10.1101/gr.083311.108 ◽

2008 ◽

Vol 19 (2) ◽

pp. 294-305 ◽

Cited By ~ 103

Author(s):

J. A. Reinhardt ◽

D. A. Baltrus ◽

M. T. Nishimura ◽

W. R. Jeck ◽

C. D. Jones ◽

...

Keyword(s):

Pseudomonas Syringae ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Download Full-text

Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

10.1101/092916 ◽

2016 ◽

Cited By ~ 4

Author(s):

Chris Wymant ◽

François Blanquart ◽

Astrid Gall ◽

Margreet Bakker ◽

Daniela Bezemer ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Consensus Sequence ◽

Heterogeneous Data ◽

Reference Sequence ◽

Data Sets ◽

Illumina Platform ◽

Short Read ◽

Variant Information ◽

Short Read Sequence

AbstractNext-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions.De novoassembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the toolshiverto preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We useshiverto reconstruct the consensus sequence and minority variant information from paired-end short-read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping toshiver’s constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs.

Download Full-text