scholarly journals Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology

2021 ◽  
Author(s):  
Marc-André Lemay ◽  
Jonas A. Sibbesen ◽  
Davoud Torkamaneh ◽  
Jérémie Hamel ◽  
Roger C. Levesque ◽  
...  

Background: Structural variant (SV) discovery based on short reads is challenging due to their complex signatures and tendency to occur in repeated regions. The increasing availability of long-read technologies has greatly facilitated SV discovery, however these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars. Results: We used Oxford Nanopore sequencing data (~12X mean coverage) for 17 samples to both benchmark SV calls made from the Illumina data and predict SVs that were subsequently genotyped in a population of 102 samples using Illumina data. Benchmarking results show that variants discovered using Oxford Nanopore can be accurately genotyped from the Illumina data. We first use the genotyped SVs for population structure analysis and show that results are comparable to those based on single-nucleotide variants. We observe that the population frequency and distribution within the genome of SVs are constrained by the location of genes. Gene Ontology and PFAM domain enrichment analyses also confirm previous reports that genes harboring high-frequency SVs are enriched for functions in defense response. Finally, we discover polymorphic transposable elements from the SVs and report evidence of the recent activity of a Stowaway MITE. Conclusions: Our results demonstrate that long-read and short-read sequencing technologies can be efficiently combined to enhance SV analysis in large populations, providing a reusable framework for their study in a wider range of samples and non-model species.

2019 ◽  
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

AbstractMotivationStudies on structural variants (SV) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies.ResultsWe present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of reference sequences that represent the two alleles of each structural variant. Long reads are aligned to these reference sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype insertions and deletions with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches.Availabilityhttps://github.com/llecompte/[email protected]


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhixing Feng ◽  
Jose C. Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.


2020 ◽  
Author(s):  
Zhixing Feng ◽  
Jose Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence ≥ 0.011%) from long-read metagenomic data. Our approach, therefore, presents a significant advance towards the complete deciphering of cellular genetic heterogeneity.


2020 ◽  
Vol 36 (17) ◽  
pp. 4568-4575
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


Cells ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1776
Author(s):  
Mourdas Mohamed ◽  
Nguyet Thi-Minh Dang ◽  
Yuki Ogyama ◽  
Nelly Burlet ◽  
Bruno Mugat ◽  
...  

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.


2019 ◽  
Vol 8 (34) ◽  
Author(s):  
Natsuki Tomariguchi ◽  
Kentaro Miyazaki

Rubrobacter xylanophilus strain AA3-22, belonging to the phylum Actinobacteria, was isolated from nonvolcanic Arima Onsen (hot spring) in Japan. Here, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.


2021 ◽  
Vol 12 ◽  
Author(s):  
Davide Bolognini ◽  
Alberto Magi

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.


2019 ◽  
Author(s):  
Mark T. W. Ebbert ◽  
Tanner D. Jensen ◽  
Karen Jansen-West ◽  
Jonathon P. Sens ◽  
Joseph S. Reddy ◽  
...  

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.


Sign in / Sign up

Export Citation Format

Share Document