PlasmidTron: assembling the cause of phenotypes from NGS data

AbstractWhen defining bacterial populations through whole genome sequencing (WGS) the samples often have detailed associated metadata that relate to disease severity, antimicrobial resistance, or even rare biochemical traits. When comparing these bacterial populations, it is apparent that some of these phenotypes do not follow the phylogeny of the host i.e. they are genetically unlinked to the evolutionary history of the host bacterium. One possible explanation for this phenomenon is that the genes are moving independently between hosts and are likely associated with mobile genetic elements (MGE). However, identifying the element that is associated with these traits can be complex if the starting point is short read WGS data. With the increased use of next generation WGS in routine diagnostics, surveillance and epidemiology a vast amount of short read data is available and these types of associations are relatively unexplored. One way to address this would be to perform assembly de novo of the whole genome read data, including its MGEs. However, MGEs are often full of repeats and can lead to fragmented consensus sequences. Deciding which sequence is part of the chromosome, and which is part of a MGE can be ambiguous. We present PlasmidTron, which utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype. Given a set of reads, categorised into cases (showing the phenotype) and controls (phylogenetically related but phenotypically negative), PlasmidTron can be used to assemble de novo reads from each sample linked by a phenotype. A k-mer based analysis is performed to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs. By utilising k-mers and only assembling a fraction of the raw reads, the method is fast and scalable to large datasets. This approach has been tested on plasmids, because of their contribution to important pathogen associated traits, such as AMR, hence the name, but there is no reason why this approach cannot be utilized for any MGE that can move independently through a bacterial population. PlasmidTron is written in Python 3 and available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.DATA SUMMARYSource code for PlasmidTron is available from Github under the open source licence GNU GPL 3; (url - https://goo.gl/ot6rT5)Simulated raw reads files have been deposited in Figshare; (url - https://doi.org/10.6084/m9.figshare.5406355.vl)Salmonella enterica serovar Weltevreden strain VNS10259 is available from GenBank; accession number GCA_001409135.Salmonella enterica serovar Typhi strain BL60006 is available from GenBank; accession number GCA_900185485.Accession numbers for all of the Illumina datasets used in this paper are listed in the supplementary tables.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTPlasmidTron utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype.

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text

ReadFilter - Filtering reads of interest for quicker downstream analysis

10.1101/266080 ◽

2018 ◽

Author(s):

Kim Lee Ng ◽

Thor Bech Johannesen ◽

Mark Østerlund ◽

Kristoffer Kiil ◽

Paal Skytt Andersen ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Whole Genome ◽

Assembly Time ◽

Link Type ◽

Redundant Data ◽

Run Time ◽

Downstream Analysis

AbstractWhole-genome sequencing is becoming the method of choice but provides redundant data for many tasks. ReadFilter (https://github.com/ssi-dk/serum_readfilter) is offered as a way to improve run time of these tasks by rapidly filtering reads against user-specified sequences in order to work with a small fraction of original reads while maintaining accuracy. This can noticeably reduce mapping time and substantially reduce de novo assembly time.

Download Full-text

Harmonization of whole-genome sequencing for outbreak surveillance of Enterobacteriaceae and Enterococci

Microbial Genomics ◽

10.1099/mgen.0.000567 ◽

2021 ◽

Vol 7 (7) ◽

Author(s):

Casper Jamin ◽

Sien De Koster ◽

Stefanie van Koeveringe ◽

Dieter De Coninck ◽

Klaas Mensaert ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Type Species ◽

De Novo ◽

Whole Genome ◽

Data Generation ◽

Sequencing Data ◽

Content Type ◽

Link Type ◽

Antimicrobial Resistance Genes

Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.

Download Full-text

Family Trio-Based Whole Genome Optical Mapping Identifies Candidate Structural Variations Predisposing Children to Acute Lymphoblastic Leukemia

Blood ◽

10.1182/blood-2019-130014 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 5201-5201

Author(s):

Ute Fischer ◽

Layal Yasin ◽

Julia Täubner ◽

Triantafyllia Brozou ◽

Arndt Borkhardt

Keyword(s):

Acute Lymphoblastic Leukemia ◽

Family History ◽

De Novo ◽

Lymphoblastic Leukemia ◽

Germline Mutations ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Short Read Sequencing ◽

Family Trio

Germline mutations account for a substantial proportion of childhood cancer and may critically affect disease characteristics, therapy efficacy, severity of treatment side effects and patient outcome. To date, only 8-10% of childhood cancer cases can be explained by germline mutations identified in known cancer predisposing genes. This is in part due to the technical limitation of next generation short read sequencing, which detects single nucleotide variants, small deletions/insertions or simple copy number variations, but is not a reliable tool to identify larger structural variations (SVs, >500 bp) which are frequent in the human genome and may impact on disease predisposition. Using whole genome optical mapping (WGOM) we aimed at identification of de novo and inherited germline SVs in a cohort of patients with clinically suspected cancer predisposition but without informative findings in short read sequencing analyses. After informed consent we performed family trio based short read (2x 100 bp) whole exome sequencing (WES) on a HiSeq2500 (Illumina) and collected clinical and demographic data for a cohort of >100 families with children affected by cancer who were treated in our hospital. About 25% of the patients either (1) had a family history indicative of cancer susceptibility, or (2) had accompanying clinical findings (e.g. developmental delay, congenital anomalies) or (3) experienced excessive toxicity during chemotherapy. From this subgroup we selected four patients with acute lymphoblastic leukemia whose sequencing data and routine genetic workup were not informative of a known cancer predisposing syndrome and employed family trio-based next generation WGOM on a Saphyr instrument equipped with Access software (Bionano Genomics) to identify genomic SVs. To this end, we extracted and labeled high molecular weight DNA molecules at specific hexamer sequence motifs (average distance: 5 kb) using a DNA methyltransferase-based direct labeling reaction. Imaging was carried out on single-molecule level and each sample genome was de novo assembled from molecule data. Consensus genome maps were clustered into two alleles and diploid assemblies created. Genomes of patients were compared to parental genomes and the GRCh38 reference genome. SVs were inferred from de novo assemblies and genome comparisons with respect to quality scores, overall molecule coverage, fraction of molecules displaying the SV event, and chimeric DNA fragment mapping. Specific SV calls were compared to a set of > 160 human control samples (provided by Bionano Genomics) to filter against common SVs and potential artifacts. Filtered SVs were annotated using structural variant and gene databases. Employing WGOM we analyzed DNA molecules 300.000 bp long on average and achieved genomic coverage ranging from 90-132x corresponding to 330-480 Gbp. For instance, for one patient, we obtained 1751 insertions, 624 deletions, 77 inversions, 21 duplications, 1 intra- and 2 inter-chromosomal translocations before filtering. The majority of these events (78%) were inherited from both parents. 20% were inherited from either father or mother and 2% were generated de novo. As the family history of this patient was inconspicuous for tumor diseases, we removed all inherited events and filtered against common variants. This resulted in only two candidate de novo lesions: a heterozygous 129,495 bp deletion framed by inversions (chr9: 66,156,733-66,622,623) in a gene-less region and a heterozygous inverted 352,667 bp duplication (chr22: 15,522,454-15.875,120) that spanned the genes OR11H, POTEH, POTEH-AS1, LINC01297, DUXAP8, and BMS1P22. Of these genes DUXAP8 is an oncogenic non-coding RNA of the homeobox gene family that has been associated with increased tumor growth and poorer prognosis in a wide variety of somatic cancers. It functions as a regulator of transcription by binding to key components of the developmental regulator epigenetic polycomb repressive complex 2 and may thus account for additional presentations of the child (dwarfism, accelerated skeletal age, linguistic developmental delay, morphological traits). Our results indicate that WGOM is a useful technology to identify candidate SVs in children predisposed to cancer and developmental syndromes. Several candidates are currently being tested and the results will be presented. Disclosures No relevant conflicts of interest to declare.

Download Full-text

The complete genome sequence of Stevia rebaudiana, the Sweetleaf

F1000Research ◽

10.12688/f1000research.24396.1 ◽

2020 ◽

Vol 9 ◽

pp. 751

Author(s):

Kathleen O'Neill ◽

Stacy Pirro

Keyword(s):

Genome Sequence ◽

Complete Genome Sequence ◽

Related Species ◽

Complete Genome ◽

De Novo ◽

Stevia Rebaudiana ◽

Whole Genome Sequence ◽

Whole Genome ◽

Link Type ◽

Sequence Read Archive

The Sweetleaf (Stevia rebaudiana: Asteraceae) is widely grown for use as a sweetener. We present the whole genome sequence and annotation of this species. A total of 146,838,888 paired-end reads consisting of 22.2G bases were obtained by sequencing one leaf from a commercially grown seedling. The reads were assembled by a de-novo method followed by alignment to related species. Annotation was performed via GenMark-ES. The raw and assembled data is publicly available via GenBank: Sequence Read Archive (SRR6792730) and Assembly (GCA_009936405).

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text

Resolving the Full Spectrum of Human Genome Variation using Linked-Reads

10.1101/230946 ◽

2017 ◽

Cited By ~ 8

Author(s):

Patrick Marks ◽

Sarah Garcia ◽

Alvaro Martinez Barrio ◽

Kamila Belhocine ◽

Jorge Bernate ◽

...

Keyword(s):

Human Genome ◽

Large Scale ◽

De Novo ◽

Simultaneous Detection ◽

Whole Genome ◽

Structural Variations ◽

Full Spectrum ◽

Short Read ◽

Short Reads ◽

A Genome

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.

Download Full-text

The complete genome sequence of Toxicodendron radicans, Eastern Poison Ivy

F1000Research ◽

10.12688/f1000research.25556.1 ◽

2020 ◽

Vol 9 ◽

pp. 1015

Author(s):

Toby Pirro ◽

Stacy Pirro

Keyword(s):

Contact Dermatitis ◽

North America ◽

Genome Sequence ◽

De Novo ◽

Whole Genome Sequence ◽

Eastern North America ◽

Whole Genome ◽

Link Type ◽

Sequence Read Archive ◽

Poison Ivy

Eastern Poison Ivy (Toxicodendron radicans, Anacardiaceae) is well known in Eastern North America for causing contact dermatitis, an itchy and painful rash in most people who come in contact with it. We present the whole genome sequence and annotation of this species. A total of 96,255,779 paired-ends reads consisting of 28.9 G bases were obtained by sequencing one leaf from a wild-collected plant. The reads were assembled by a de novo method followed by alignment to related species. Annotation was performed via GenMark-ES. The raw and assembled data is publicly available via GenBank: Sequence Read Archive (SRR10325927) and Assembly (GCA_009867345).

Download Full-text

Whole genome analysis of an extended pedigree with Prader–Willi Syndrome, hereditary hemochromatosis, and dysautonomia-like symptoms

10.1101/019182 ◽

2015 ◽

Author(s):

Han Fang ◽

Yiyang Wu ◽

Margaret Yoon ◽

Laura T. Jiménez-Barrón ◽

Jason A. O'Rawe ◽

...

Keyword(s):

De Novo ◽

Chromosome Region ◽

Hereditary Hemochromatosis ◽

Copy Number Variations ◽

Prader Willi Syndrome ◽

Whole Genome ◽

Whole Genome Analysis ◽

Phenotypic Data ◽

Human Phenotype ◽

Congenital Insensitivity

This report includes the discovery and analysis of a pedigree with Prader–Willi Syndrome (PWS), hereditary hemochromatosis (HH), and dysautonomia-like symptoms. Nine members of the family participated in whole genome sequencing (WGS), which enabled a wide scope of variant calling from single-nucleotide polymorphisms to copy number variations. First, a 5.5 Mb de novo deletion is identified in the chromosome region 15q11.2 to 15q13.1 in the boy with PWS. Second, a female invididual with HH is homozygous for the p.C282Y variant in HFE, a mutation known to be associated with HH. Her brother is homozygous for the same variant, although he has yet to be clinically diagnosed with HH. Third, none of the people with dysautonomia-like symptoms carry any reported or novel rare variants in IKBKAP that are implicated in familial dysautonomia (FD - HSAN III). Although two people with dysautonomia-like symptoms carry two heterozygous variants in NTRK1, a gene that has been shown to contribute to HSAN IV (congenital insensitivity to pain with anhidrosis, a disease that closely resembles FD), this variant is not present in the third proband. Fourth, WGS revealed pharmacogenetic variants influencing the metabolism of warfarin and simvastatin, which are being routinely prescribed to the proband. Finally, reports of the phenotypes were standardized with the Human Phenotype Ontology annotation, which may facilitate the search for other families with similar phenotypes. Due to the extreme heterogeneity and insufficient knowledge of human diseases, it is of crucial importance that both phenotypic data and genomic data are standardized and shared.

Download Full-text

New synthetic-diploid benchmark for accurate variant calling evaluation

10.1101/223297 ◽

2017 ◽

Cited By ~ 9

Author(s):

Heng Li ◽

Jonathan M Bloom ◽

Yossi Farjoun ◽

Mark Fleharty ◽

Laura Gauthier ◽

...

Keyword(s):

Cell Lines ◽

Human Cell ◽

Error Rate ◽

De Novo ◽

Variant Calling ◽

Benchmark Dataset ◽

Whole Genome ◽

Human Cell Lines ◽

Short Read ◽

Benchmark Datasets

Constructed from the consensus of multiple variant callers based on short-read data, existing benchmark datasets for evaluating variant calling accuracy are biased toward easy regions accessible by known algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome. This benchmark provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.

Download Full-text