Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses

AbstractNext-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions.De novoassembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the toolshiverto preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We useshiverto reconstruct the consensus sequence and minority variant information from paired-end short-read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping toshiver’s constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs.

Download Full-text

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data

10.1101/016345 ◽

2015 ◽

Author(s):

Jane Hawkey ◽

Mohammad Hamidian ◽

Ryan R Wick ◽

David J Edwards ◽

Helen Billman-Jacobe ◽

...

Keyword(s):

Sequence Data ◽

Insertion Sequences ◽

Bacterial Genomes ◽

Short Read ◽

Phenotypic Resistance ◽

Genome Wide ◽

Wide Range ◽

Multiple Copies ◽

Short Read Sequence ◽

Insertion Sites

Background Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, direct from paired-end short read data. Results ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 96% of known IS insertions in the analysis of simulated reads, and 98% in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was reliable for average genome-wide read depths >20x. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots. Conclusions ISMapper provides a rapid and robust method for identifying IS insertion sites direct from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

Download Full-text

Mutational sequencing for accurate count and long-range assembly

10.1101/149740 ◽

2017 ◽

Author(s):

Vijay Kumar ◽

Julie Rosenbaum ◽

Zihua Wang ◽

Talitha Forcier ◽

Michael Ronemus ◽

...

Keyword(s):

Long Range ◽

Copy Number ◽

Sequence Data ◽

Template Molecule ◽

Short Read ◽

Unique Pattern ◽

Short Read Sequence

ABSTRACTWe introduce a new protocol, mutational sequencing or muSeq, which randomly deaminates unmethylated cytosines at a fixed and tunable rate. The muSeq protocol marks each initial template molecule with a unique mutation signature that is present in every copy of the template, and in every fragmented copy of a copy. In the sequenced read data, this signature is observed as a unique pattern of C-to-T or G-to-A nucleotide conversions. Clustering reads with the same conversion pattern enables accurate count and long-range assembly of initial template molecules from short-read sequence data. We explore count and low-error sequencing by profiling a 135,000 fragment PstI representation, demonstrating that muSeq improves copy number inference and significantly reduces sporadic sequencer error. We explore long-range assembly in the context of cDNA, generating contiguous transcript clusters greater than 3,000 bp in length. The muSeq assemblies reveal transcriptional diversity not observable from short-read data alone.

Download Full-text

MOST: a modified MLST typing tool based on short read sequencing

PeerJ ◽

10.7717/peerj.2308 ◽

2016 ◽

Vol 4 ◽

pp. e2308 ◽

Cited By ~ 63

Author(s):

Rediat Tewolde ◽

Timothy Dallman ◽

Ulf Schaefer ◽

Carmen L. Sheppard ◽

Philip Ashton ◽

...

Keyword(s):

Conventional Method ◽

Sequence Data ◽

Pcr Amplification ◽

Housekeeping Genes ◽

Data Sets ◽

Bacterial Genomes ◽

Bacterial Populations ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets fromSalmonella enteridisandStreptococcus pneumoniae. Of the 323 samples, 92.9% (n= 300), 97.5% (n= 315) and 99.7% (n= 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n= 49) and 67.3% (n= 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches.

Download Full-text

Using Short Read Sequencing to Characterise Balanced Reciprocal Translocations in Pigs

10.21203/rs.3.rs-28830/v1 ◽

2020 ◽

Author(s):

Aniek Cornelia Bouwman ◽

Martijn F.L. Derks ◽

Marleen L.W.J. Broekhuijse ◽

Barbara Harlizius ◽

Roel F. Veerkamp

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Reciprocal Translocations ◽

Short Read ◽

Short Read Sequencing ◽

Long Read ◽

Short Read Sequence ◽

Staining Techniques ◽

Chromosome Staining ◽

Paired End Sequencing

Abstract Background A balanced constitutional reciprocal translocation (RT) is a mutual exchange of terminal segments of two non-homologous chromosomes without any loss or gain of DNA in germline cells. Carriers of balanced RTs are viable individuals with no apparent phenotypical consequences. These animals produce, however, unbalanced gametes and show therefore reduced fertility and offspring with congenital abnormalities. This cytogenetic abnormality is usually detected using chromosome staining techniques. The aim of this study was to test the possibilities of using paired end short read sequencing for detection of balanced RTs in boars and investigate their breakpoints and junctions.Results Balanced RTs were recovered in a blinded analysis, using structural variant calling software DELLY, in 6 of the 7 carriers with 30 fold short read paired end sequencing. In 15 non-carriers we did not detect any RTs. Reducing the coverage to 20 fold, 15 fold and 10 fold showed that at least 20 fold coverage is required to obtain good results. One RT was not detected using the blind screening, however, a highly likely RT was discovered after unblinding. This RT was located in a repetitive region, showing the limitations of short read sequence data. The detailed analysis of the breakpoints and junctions suggested three junctions showing microhomology, three junctions with blunt-end ligation, and three micro-insertions at the breakpoint junctions. The RTs detected also showed to disrupt genes.Conclusions We conclude that paired end short read sequence data can be used to detect and characterize balanced reciprocal translocations, if sequencing depth is at least 20 fold coverage. However, translocations in repetitive areas may require large fragments or even long read sequence data.

Download Full-text

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Genes ◽

10.3390/genes11020141 ◽

2020 ◽

Vol 11 (2) ◽

pp. 141 ◽

Cited By ~ 5

Author(s):

Feichen Shen ◽

Jeffrey M. Kidd

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Sequence Data ◽

Data Sets ◽

Short Read ◽

Major Mechanism ◽

Rapid Construction ◽

A Genome ◽

Number Variation ◽

Short Read Sequence

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Download Full-text

Evaluating short-read sequence data from the highly redundant, novel transcriptome of Polarella glacialis

Genome Biology ◽

10.1186/gb-2011-12-s1-p5 ◽

2011 ◽

Vol 12 (Suppl 1) ◽

pp. P5

Author(s):

Theodore R Gibbons ◽

Gregory T Concepcion ◽

Tsvetan R Bachvaroff ◽

Charles F Delwiche

Keyword(s):

Sequence Data ◽

Short Read ◽

Short Read Sequence

Download Full-text

De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae

Genome Research ◽

10.1101/gr.083311.108 ◽

2008 ◽

Vol 19 (2) ◽

pp. 294-305 ◽

Cited By ~ 103

Author(s):

J. A. Reinhardt ◽

D. A. Baltrus ◽

M. T. Nishimura ◽

W. R. Jeck ◽

C. D. Jones ◽

...

Keyword(s):

Pseudomonas Syringae ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Download Full-text

Detection of extended-spectrum beta-lactamase (ESBL) genes and plasmid replicons in Enterobacteriaceae using PlasmidSPAdes assembly of short-read sequence data

Microbial Genomics ◽

10.1099/mgen.0.000400 ◽

2020 ◽

Vol 6 (7) ◽

Author(s):

Joep J.J.M. Stohr ◽

Marjolein F.Q. Kluytmans-van den Bergh ◽

Ronald Wedema ◽

Alexander W. Friedrich ◽

Jan A.J.W. Kluytmans ◽

...

Keyword(s):

Sequence Data ◽

Beta Lactamase ◽

Short Read ◽

Content Type ◽

Extended Spectrum Beta Lactamase ◽

Link Type ◽

Extended Spectrum ◽

Plasmid Replicon ◽

Short Read Sequence ◽

Plasmid Replicons

Knowledge of the epidemiology of plasmids is essential for understanding the evolution and spread of antimicrobial resistance. PlasmidSPAdes attempts to reconstruct plasmids using short-read sequence data. Accurate detection of extended-spectrum beta-lactamase (ESBL) genes and plasmid replicon genes is a prerequisite for the use of plasmid assembly tools to investigate the role of plasmids in the spread and evolution of ESBL production in Enterobacteriaceae . This study evaluated the performance of PlasmidSPAdes plasmid assembly for Enterobacteriaceae in terms of detection of ESBL-encoding genes, plasmid replicons and chromosomal wgMLST genes, and assessed the effect of k-mer size. Short-read sequence data for 59 ESBL-producing Enterobacteriaceae were assembled with PlasmidSPAdes using different k-mer sizes (21, 33, 55, 77, 99 and 127). For every k-mer size, the presence of ESBL genes, plasmid replicons and a selection of chromosomal wgMLST genes in the plasmid assembly was determined. Out of 241 plasmid replicons and 66 ESBL genes detected by whole-genome assembly, 213 plasmid replicons [88 %; 95 % confidence interval (CI): 83.9–91.9] and 43 ESBL genes (65 %; 95 % CI: 53.1–75.6) were detected in the plasmid assemblies obtained by PlasmidSPAdes. For most ESBL genes (83.3 %) and plasmid replicons (72.0 %), detection results did not differ between the k-mer sizes used in the plasmid assembly. No optimal k-mer size could be defined for the number of ESBL genes and plasmid replicons detected. For most isolates, the number of chromosomal wgMLST genes detected in the plasmid assemblies decreased with increasing k-mer size. Based on our findings, PlasmidSPAdes is not a suitable plasmid assembly tool for short-read sequence data for ESBL-encoding plasmids of Enterobacteriaceae .

Download Full-text