scholarly journals A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-medoids Clustering

2018 ◽  
Author(s):  
Matthew G. Johnson ◽  
Lisa Pokorny ◽  
Steven Dodsworth ◽  
Laura R. Botigue ◽  
Robyn S. Cowan ◽  
...  

AbstractSequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost associated with developing targeted sequencing approaches is preliminary data needed for identifying orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm lineage. To maximize the phylogenetic potential of the probes while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, five to 15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order lineages of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order lineages, including all angiosperms.

Chromosoma ◽  
2021 ◽  
Vol 130 (1) ◽  
pp. 15-25
Author(s):  
Phuong T. N. Hoang ◽  
Jean-Marie Rouillard ◽  
Jiří Macas ◽  
Ivona Kubalová ◽  
Veit Schubert ◽  
...  

AbstractDuckweeds represent a small, free-floating aquatic family (Lemnaceae) of the monocot order Alismatales with the fastest growth rate among flowering plants. They comprise five genera (Spirodela, Landoltia, Lemna, Wolffiella, and Wolffia) varying in genome size and chromosome number. Spirodela polyrhiza had the first sequenced duckweed genome. Cytogenetic maps are available for both species of the genus Spirodela (S. polyrhiza and S. intermedia). However, elucidation of chromosome homeology and evolutionary chromosome rearrangements by cross-FISH using Spirodela BAC probes to species of other duckweed genera has not been successful so far. We investigated the potential of chromosome-specific oligo-FISH probes to address these topics. We designed oligo-FISH probes specific for one S. intermedia and one S. polyrhiza chromosome (Fig. 1a). Our results show that these oligo-probes cross-hybridize with the homeologous regions of the other congeneric species, but are not suitable to uncover chromosomal homeology across duckweeds genera. This is most likely due to too low sequence similarity between the investigated genera and/or too low probe density on the target genomes. Finally, we suggest genus-specific design of oligo-probes to elucidate chromosome evolution across duckweed genera.


2018 ◽  
Vol 68 (4) ◽  
pp. 594-606 ◽  
Author(s):  
Matthew G Johnson ◽  
Lisa Pokorny ◽  
Steven Dodsworth ◽  
Laura R Botigué ◽  
Robyn S Cowan ◽  
...  

1999 ◽  
Vol 43 (6) ◽  
pp. 1500-1502 ◽  
Author(s):  
Sunwen Chou ◽  
Nell S. Lurain ◽  
Adriana Weinberg ◽  
Guang-Yung Cai ◽  
Prem L. Sharma ◽  
...  

ABSTRACT The polymerase (pol) coding sequence was determined for 40 independent clinical cytomegalovirus isolates sensitive to ganciclovir and foscarnet. Sequence alignments showed >98% interstrain homology and amino acid variation in only 4% of the 1,237 codons. Almost all variation occurred outside of conserved functional domains where resistance mutations have been identified.


2004 ◽  
Vol 17 (2) ◽  
pp. 145 ◽  
Author(s):  
Randall L. Small ◽  
Richard C. Cronn ◽  
Jonathan F. Wendel

Molecular data have had a profound impact on the field of plant systematics, and the application of DNA-sequence data to phylogenetic problems is now routine. The majority of data used in plant molecular phylogenetic studies derives from chloroplast DNA and nuclear rDNA, while the use of low-copy nuclear genes has not been widely adopted. This is due, at least in part, to the greater difficulty of isolating and characterising low-copy nuclear genes relative to chloroplast and rDNA sequences that are readily amplified with universal primers. The higher level of sequence variation characteristic of low-copy nuclear genes, however, often compensates for the experimental effort required to obtain them. In this review, we briefly discuss the strengths and limitations of chloroplast and rDNA sequences, and then focus our attention on the use of low-copy nuclear sequences. Advantages of low-copy nuclear sequences include a higher rate of evolution than for organellar sequences, the potential to accumulate datasets from multiple unlinked loci, and bi-parental inheritance. Challenges intrinsic to the use of low-copy nuclear sequences include distinguishing orthologous loci from divergent paralogous loci in the same gene family, being mindful of the complications arising from concerted evolution or recombination among paralogous sequences, and the presence of intraspecific, intrapopulational and intraindividual polymorphism. Finally, we provide a detailed protocol for the isolation, characterisation and use of low-copy nuclear sequences for phylogenetic studies.


2017 ◽  
Author(s):  
Ryan K Schott ◽  
Bhawandeep Panesar ◽  
Daren C Card ◽  
Matthew Preston ◽  
Todd A Castoe ◽  
...  

AbstractDespite continued advances in sequencing technologies, there is a need for methods that can efficiently sequence large numbers of genes from diverse species. One approach to accomplish this is targeted capture (hybrid enrichment). While these methods are well established for genome resequencing projects, cross-species capture strategies are still being developed and generally focus on the capture of conserved regions, rather than complete coding regions from specific genes of interest. The resulting data is thus useful for phylogenetic studies, but the wealth of comparative data that could be used for evolutionary and functional studies is lost. Here we design and implement a targeted capture method that enables recovery of complete coding regions across broad taxonomic scales. Capture probes were designed from multiple reference species and extensively tiled in order to facilitate cross-species capture. Using novel bioinformatics pipelines we were able to recover nearly all of the targeted genes with high completeness from species that were up to 200 myr divergent. Increased probe diversity and tiling for a subset of genes had a large positive effect on both recovery and completeness. The resulting data produced an accurate species tree, but importantly this same data can also be applied to studies of molecular evolution and function that will allow researchers to ask larger questions in broader phylogenetic contexts. Our method demonstrates the utility of cross-species approaches for the capture of full length coding sequences, and will substantially improve the ability for researchers to conduct large-scale comparative studies of molecular evolution and function.


2019 ◽  
Author(s):  
Alex Dornburg ◽  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


2017 ◽  
Author(s):  
Minh Duc Cao ◽  
Devika Ganesamoorthy ◽  
Lachlan J.M. Coin

AbstractMotivationTargeted sequencing using capture probes has become increasingly popular in clinical applications due to its scalability and cost-effectiveness. The approach also allows for higher sequencing coverage of the targeted regions resulting in better analysis statistical power. However, because of the dynamics of the hybridisation process, it is difficult to evaluate the efficiency of the probe design prior to the experiments which are time consuming and costly.ResultsWe developed CapSim, a software package for simulation of targeted sequencing. Given a genome sequence and a set of probes, CapSim simulates the fragmentation, the dynamics of probe hybridisation, and the sequencing of the captured fragments on Illumina and PacBio sequencing platforms. The simulated data can be used for evaluating the performance of the analysis pipeline, as well as the efficiency of the probe design. Parameters of the various stages in the sequencing process can also be evaluated in order to optimise the efficacy of the experiments.AvailabilityCapSim is publicly available under BSD license at https://github.com/mdcao/capsim.


Blood ◽  
2014 ◽  
Vol 124 (21) ◽  
pp. 5042-5042
Author(s):  
Patricia Severino ◽  
Liliane Santana Oliveira ◽  
Natalia Torres ◽  
Joao Carlos Guerra ◽  
Nelson Hamerschlak ◽  
...  

Abstract Hemophilia A, B, and von Willebrand disease correspond to more than 90% of all inherited bleeding disorders associated with coagulation factor deficiencies. Symptoms between these deficiencies may vary greatly and yet are often phenotypically similar. Bleeding episodes can range from mild to severe, at times with life threatening hemorrhages. Currently, biochemical assays are performed to assess the function of each coagulation factor, but diagnosis remains cumbersome and prone to multiple sources of variability between laboratories. Genetic evaluation allows for the examination of multiple coagulation factor genes simultaneously and may quickly identify possible causes to the disease. Additionally, genetic testing should be more reproducible and readily comparable between clinical laboratories. In this work we evaluate the potential use of targeted sequencing of three coagulation factors genes – F8, F9 and VWF – for the concurrent diagnosis and characterization of hemophilia A, B, and von Willebrand disease samples. For targeted DNA sequencing we selected specific DNA probes using genomic coordinates spanning the complete intronic and exonic regions of the three genes, as well as flanking gene sequences. Eleven hemophilia A samples and four hemophilia B samples, clinically characterized and submitted to Sanger sequencing for F8 and F9 genes coding regions, respectively, were included in this study. Our results indicate that even though DNA quality may be ideal for traditional DNA sequencing, enrichment techniques require more intact fragments, as reflected by variations in sequencing coverage between samples: quadruplicate results per sample showed 100X coverage varying from 80% of sequenced regions to less then 20%. Point substitutions found in F9 genes by Sanger sequencing were confirmed by targeted sequencing, but results for F8 gene were less satisfactory, in agreement with probe design limitations at this point. Of interest for hemophilia A patients, four samples possessed, in addition to the alterations in F8, point mutations in VWF. Probe design and sequencing parameters did not allow for the identification of F8 intron 1 and intron 22 inversions, frequent alterations in hemophilia A, but optimization procedures are currently underway. We conclude that targeted sequencing approach may be a viable and more complete solution for the diagnosis and management of hemophilia A, B and von Willebrand disease. Disclosures No relevant conflicts of interest to declare.


2018 ◽  
Author(s):  
Chang Xu ◽  
Xiujing Gu ◽  
Raghavendra Padmanabhan ◽  
Zhong Wu ◽  
Quan Peng ◽  
...  

AbstractMotivationLow-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end-repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.ResultsWe developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit at 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2’s superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.AvailabilityThe entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.


Sign in / Sign up

Export Citation Format

Share Document