scholarly journals High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing

2019 ◽  
Author(s):  
Devika Ganesamoorthy ◽  
Mengjia Yan ◽  
Valentine Murigneux ◽  
Chenxi Zhou ◽  
Minh Duc Cao ◽  
...  

ABSTRACTTandem repeats (TRs) are highly prone to variation in copy numbers due to their repetitive and unstable nature, which makes them a major source of genomic variation between individuals. However, population variation of TRs have not been widely explored due to the limitations of existing tools, which are either low-throughput or restricted to a small subset of TRs. Here, we used SureSelect targeted sequencing approach combined with Nanopore sequencing to overcome these limitations. We achieved an average of 3062-fold target enrichment on a panel of 142 TR loci, generating an average of 97X sequence coverage on 7 samples utilizing 2 MinION flow-cells with 200ng of input DNA per sample. We identified a subset of 110 TR loci with length less than 2kb, and GC content greater than 25% for which we achieved an average genotyping rate of 75% and increasing to 91% for the highest-coverage sample. Alleles estimated from targeted long-read sequencing were concordant with gold standard PCR sizing analysis and moreover highly correlated with alleles estimated from whole genome long-read sequencing. We demonstrate a targeted long-read sequencing approach that enables simultaneous analysis of hundreds of TRs and accuracy is comparable to PCR sizing analysis. Our approach is feasible to scale for more targets and more samples facilitating large-scale analysis of TRs.

F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 1084
Author(s):  
Devika Ganesamoorthy ◽  
Mengjia Yan ◽  
Valentine Murigneux ◽  
Chenxi Zhou ◽  
Minh Duc Cao ◽  
...  

Background: Tandem repeats (TRs) are highly prone to variation in copy numbers due to their repetitive and unstable nature, which makes them a major source of genomic variation between individuals. However, population variation of TRs has not been widely explored due to the limitations of existing approaches, which are either low-throughput or restricted to a small subset of TRs. Here, we demonstrate a targeted sequencing approach combined with Nanopore sequencing to overcome these limitations. Methods: We selected 142 TR targets and enriched these regions using Agilent SureSelect target enrichment approach with only 200 ng of input DNA. We barcoded the enriched products and sequenced on Oxford Nanopore MinION sequencer. We used VNTRTyper and Tandem-genotypes to genotype TRs from long-read sequencing data. Gold standard PCR sizing analysis was used to validate genotyping results from targeted sequencing data.  Results: We achieved an average of 3062-fold target enrichment on a panel of 142 TR loci, generating an average of 97X coverage per sample with 200 ng of input DNA per sample. We successfully genotyped an average of 75% targets and genotyping rate increased to 91% for the highest-coverage sample for targets with length less than 2 kb, and GC content greater than 25%. Alleles estimated from targeted long-read sequencing were concordant with gold standard PCR sizing analysis and highly correlated with alleles estimated from whole genome long-read sequencing. Conclusions: We demonstrate a targeted long-read sequencing approach that enables simultaneous analysis of hundreds of TRs and accuracy is comparable to PCR sizing analysis. Our approach is feasible to scale for more targets and more samples facilitating large-scale analysis of TRs.


2018 ◽  
Author(s):  
Luisa Berná ◽  
Matías Rodríguez ◽  
María Laura Chiribao ◽  
Adriana Parodi-Talice ◽  
Sebastián Pita ◽  
...  

Although the genome ofTrypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degree of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated withT. cruzi´sgenome since they permit directly determining the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, allows not only accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of twoT. cruziclones: the hybrid TCC (DTU TcVI) and the non-hybrid Dm28c (DTU TcI), determined by PacBio SMRT technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome ofT. cruziis composed of a "core compartment" and a "disruptive compartment" which exhibit opposite gene and GC content composition. New tandem and disperse repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families MUC and trans-sialidases allows now a better overview of these complex groups of genes.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Arne De Roeck ◽  
Wouter De Coster ◽  
Liene Bossaerts ◽  
Rita Cacace ◽  
Tim De Pooter ◽  
...  

AbstractTechnological limitations have hindered the large-scale genetic investigation of tandem repeats in disease. We show that long-read sequencing with a single Oxford Nanopore Technologies PromethION flow cell per individual achieves 30× human genome coverage and enables accurate assessment of tandem repeats including the 10,000-bp Alzheimer’s disease-associated ABCA7 VNTR. The Guppy “flip-flop” base caller and tandem-genotypes tandem repeat caller are efficient for large-scale tandem repeat assessment, but base calling and alignment challenges persist. We present NanoSatellite, which analyzes tandem repeats directly on electric current data and improves calling of GC-rich tandem repeats, expanded alleles, and motif interruptions.


2016 ◽  
Author(s):  
Yuan O Zhu ◽  
Gavin J Sherlock ◽  
Dmitri A Petrov

Budding yeast has undergone several independent transitions from commercial to clinical lifestyles. The frequency of such transitions suggests that clinical yeast strains are derived from environmentally available yeast populations, including commercial sources. However, despite their important role in adaptive evolution, the prevalence of polyploidy and aneuploidy has not extensively analyzed in clinical strains. In this study, we have looked for patterns governing the transition to clinical invasion in the largest screen of clinical yeast isolates to date. In particular, we have focused on the hypothesis that ploidy changes have influenced adaptive processes. We sequenced 145 yeast strains, 132 of which are clinical isolates. We found pervasive large-scale genomic variation in both overall ploidy (34% of strains identified as 3n/4n) and individual chromosomal copy numbers (36% of strains identified as aneuploid). We also found evidence for the highly dynamic nature of yeasts genomes, with 35 strains showing partial chromosomal copy number changes and 8 strains showing multiple independent chromosomal events. Intriguingly, a lineage identified to be baker/commercial derived with a unique damaging mutation in NDC80 was particularly prone to polyploidy, with 83% of its members being triploid or tetraploid. Polyploidy was in turn associated with a >2x increase in aneuploidy rates as compared to other lineages. This dataset provides a rich source of information of the genomics of clinical yeast strains and highlights the potential importance of large-scale genomic copy variation in yeast adaptation.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Paula Moolhuijzen ◽  
Pao Theen See ◽  
Caroline S. Moffat

Abstract Objectives The assembly of fungal genomes using short-reads is challenged by long repetitive and low GC regions. However, long-read sequencing technologies, such as PacBio and Oxford Nanopore, are able to overcome many problematic regions, thereby providing an opportunity to improve fragmented genome assemblies derived from short reads only. Here, a necrotrophic fungal pathogen Pyrenophora tritici-repentis (Ptr) isolate 134 (Ptr134), which causes tan spot disease on wheat, was sequenced on a MinION using Oxford Nanopore Technologies (ONT), to improve on a previous Illumina short-read genome assembly and provide a more complete genome resource for pan-genomic analyses of Ptr. Results The genome of Ptr134 sequenced on a MinION using ONT was assembled into 28 contiguous sequences with a total length of 40.79 Mb and GC content of 50.81%. The long-read assembly provided 6.79 Mb of new sequence and 2846 extra annotated protein coding genes as compared to the previous short-read assembly. This improved genome sequence represents near complete chromosomes, an important resource for large scale and pan genomic comparative analyses.


2019 ◽  
Vol 12 (1) ◽  
Author(s):  
Paula Moolhuijzen ◽  
Pao Theen See ◽  
Caroline S. Moffat

Abstract Objectives The necrotrophic fungal pathogen Pyrenophora tritici-repentis (Ptr) is the causal agent of tan spot a major disease of wheat. We have generated a new genome resource for an Australian Ptr race 1 isolate V1 to support comparative ‘omics analyses. In particular, the V1 PacBio Biosciences long-read sequence assembly was generated to confirm the stability of large-scale genome rearrangements of the Australian race 1 isolate M4 when compared to the North American race 1 isolate Pt-1C-BFP. Results Over 1.3 million reads were sequenced by PacBio Sequel small-molecule real-time sequencing (SRMT) cell to yield 11.4 Gb for the genome assembly of V1 (285X coverage), with median and maximum read lengths of 8959 bp and 72,292 bp respectively. The V1 genome was assembled into 33 contiguous sequences with a of total length 40.4 Mb and GC content of 50.44%. A total of 14,050 protein coding genes were predicted and annotated for V1. Of these 11,519 genes were orthologous to both Pt-1C-BFP and M4. Whole genome alignment of the Australian long-read assemblies (V1 to M4) confirmed previously identified large-scale genome rearrangements between M4 and Pt-1C-BFP and presented small scale variations, which included a sequence break within a race-specific region for ToxA, a well-known necrotrophic effector gene.


2016 ◽  
Vol 2 ◽  
pp. e94 ◽  
Author(s):  
Gaëtan Benoit ◽  
Pierre Peterlongo ◽  
Mahendra Mariadassou ◽  
Erwan Drezen ◽  
Sophie Schbath ◽  
...  

BackgroundLarge scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand,de novomethods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results.MethodsThese limitations motivated the development of a newde novometagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts byk-mer counts. Simka scales-up today’s metagenomic projects thanks to a new parallelk-mer counting strategy on multiple datasets.ResultsExperiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at thek-mer level is highly correlated with extremely precisede novocomparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling.


2016 ◽  
Author(s):  
Jason D. Merker ◽  
Aaron M. Wenger ◽  
Tam Sneddon ◽  
Megan Grove ◽  
Daryl Waggott ◽  
...  

AbstractCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), which offers high throughput, high base accuracy, and low cost per base. SRS has, however, limited ability to evaluate tandem repeats, regions with high [GC] or [AT] content, highly polymorphic regions, highly paralogous regions, and large-scale structural variants. Long-read sequencing (LRS) has complementary strengths and offers a means to discover overlooked genetic variation in patients undiagnosed by SRS. To evaluate LRS, we selected a patient who presented with multiple neoplasia and cardiac myxomata suggestive of Carney complex for whom targeted clinical gene testing and whole genome SRS were negative. Low coverage whole genome LRS was performed on the PacBio Sequel system and structural variants were called, yielding 6,971 deletions and 6,821 insertions > 50bp. Filtering for variants that are absent in an unrelated control and that overlap a coding exon of a disease gene identified three deletions and three insertions. One of these, a heterozygous 2,184 bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. This variant was confirmed by Sanger sequencing and was classified as pathogenic using standard criteria for the interpretation of sequence variants. This first successful application of whole genome LRS to identify a pathogenic variant suggests that LRS has significant potential to identify disease-causing structural variation. We recommend larger studies to evaluate the diagnostic yield of LRS, and the development of a comprehensive catalog of common human structural variation to support future studies.


2019 ◽  
Author(s):  
Ryther Anderson ◽  
Achay Biong ◽  
Diego Gómez-Gualdrón

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>


BMC Biology ◽  
2019 ◽  
Vol 17 (1) ◽  
Author(s):  
Amrita Srivathsan ◽  
Emily Hartop ◽  
Jayanthi Puniamoorthy ◽  
Wan Ting Lee ◽  
Sujatha Narayanan Kutty ◽  
...  

Abstract Background More than 80% of all animal species remain unknown to science. Most of these species live in the tropics and belong to animal taxa that combine small body size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted based on detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for sorting specimens to species. Morphological verification of such “molecular operational taxonomic units” (mOTUs) could then be based on dissection of a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well-equipped, well-funded molecular laboratories are not readily available in many biodiverse countries. Results We here document how MinION sequencing can be used for large-scale species discovery in a specimen- and species-rich taxon like the hyperdiverse fly family Phoridae (Diptera). We sequenced 7059 specimens collected in a single Malaise trap in Kibale National Park, Uganda, over the short period of 8 weeks. We discovered > 650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using an improved low-cost MinION pipeline that increased the barcoding capacity sevenfold from 500 to 3500 barcodes per flowcell. This was achieved by adopting 1D sequencing, resequencing weak amplicons on a used flowcell, and improving demultiplexing. Comparison with Illumina data revealed that the MinION barcodes were very accurate (99.99% accuracy, 0.46% Ns) and thus yielded very similar species units (match ratio 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with morphology (93% of mOTUs; > 99% of specimens) and revealed that 90% of the putative species belong to the neglected, megadiverse genus Megaselia. We demonstrate for one Megaselia species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.). Conclusions We document that one field site in Africa can be home to an estimated 1000 species of phorids and speculate that the Afrotropical diversity could exceed 200,000 species. We furthermore conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing could quickly reveal the extent of the unknown diversity and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.


Sign in / Sign up

Export Citation Format

Share Document