scholarly journals Mapping-free variant calling using haplotype reconstruction from k-mer frequencies

2017 ◽  
Author(s):  
Peter Audano ◽  
Shashidhar Ravishankar ◽  
Fredrik Vannberg

1AbstractMotivationThe standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known references. De novo assembly or hybrid methods can recover genomic variation, but the cost of computation is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely-packed SNPs and large indels without mapping, assembly, or de Bruijn graphs.ResultsWhen applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumoniae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when calling variants over the whole genome, but limitations of a purely k-mer based approach affect sensitivity.AvailabilitySource code and documentation for a Java implementation of Kestrel can be found at https://github.com/paudano/kestrel. All test code for this publication is located at https://github.com/paudano/[email protected], [email protected]


2018 ◽  
Vol 609 ◽  
pp. A36 ◽  
Author(s):  
Jonathan J. Stott

Aims. My goal is to develop a quantitative algorithm for assessing open cluster membership probabilities. The algorithm is designed to work with single-epoch observations. In its simplest form, only one set of program images and one set of reference images are required. Methods. The algorithm is based on a two-stage joint astrometric and photometric assessment of cluster membership probabilities. The probabilities were computed within a Bayesian framework using any available prior information. Where possible, the algorithm emphasizes simplicity over mathematical sophistication. Results. The algorithm was implemented and tested against three observational fields using published survey data. M 67 and NGC 654 were selected as cluster examples while a third, cluster-free, field was used for the final test data set. The algorithm shows good quantitative agreement with the existing surveys and has a false-positive rate significantly lower than the astrometric or photometric methods used individually.



2018 ◽  
Author(s):  
Jack M. Fu ◽  
Elizabeth J. Leslie ◽  
Alan F. Scott ◽  
Jeffrey C. Murray ◽  
Mary L. Marazita ◽  
...  

AbstractDe novo copy number deletions have been implicated in many diseases, but there is no formal method to date however that identifies de novo deletions in parent-offspring trios from capture-based sequencing platforms. We developed Minimum Distance for Targeted Sequencing (MDTS) to fill this void. MDTS has similar sensitivity (recall), but a much lower false positive rate compared to less specific CNV callers, resulting in a much higher positive predictive value (precision). MDTS also exhibited much better scalability, and is available as open source software at github.com/JMF47/MDTS.



2019 ◽  
Author(s):  
Jullien M. Flynn ◽  
Robert Hubley ◽  
Clément Goubert ◽  
Jeb Rosen ◽  
Andrew G. Clark ◽  
...  

AbstractThe accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).SignificanceGenome sequences are being produced for more and more eukaryotic species. The bulk of these genomes is composed of parasitic, self-mobilizing transposable elements (TEs) that play important roles in organismal evolution. Thus there is a pressing need for developing software that can accurately identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package for the curation of reference TE libraries which can be applied to any eukaryotic species. Through several major improvements over the previous version, RepeatModeler2 is able to produce libraries that recapitulate the known composition of three model species with some of the most complex TE landscapes. Thus RepeatModeler2 will greatly enhance the discovery and annotation of TEs in genome sequences.



1992 ◽  
Vol 4 (3) ◽  
pp. 238-244 ◽  
Author(s):  
Ronald M. Weigel ◽  
William F. Hall ◽  
Gail Scherba ◽  
Arthur M. Siegel ◽  
Edwin C. Hahn ◽  
...  

The diagnostic performance of 2 enzyme-linked immunosorbent assays (gX-T, gX-H) for antibodies to pseudorabies virus (PRV) glycoprotein X (gX) were evaluated using 311 serum samples from a nonvaccinated quarantined herd. When the standardized virus neutralization (VN) test, which uses the Shope strain (VN Shope), was used as the comparative diagnostic standard, the gX-T test had a 7% false-negative rate and a 52% false-positive rate, and the gX-H test had a 19% false-negative rate and a 19% false-positive rate. When the VN test with a Bartha recombinant strain (VN Bartha gIIIKa) was used as the diagnostic standard, the gX-T test had a 9% false-negative rate and a 26% false-positive rate, and the gX-H test had a 24% false-negative rate and a 11% false-positive rate. Thus, the gX-T test was more sensitive and the gX-H test was more specific. Additional diagnostic tests on 79 serum samples from a noninfected herd did not produce false positives for the gX-H test, but there was an 8% false-positive rate for the gX-T test. Previous studies from our laboratory have demonstrated that VN Bartha gIIIKa has higher sensitivity than VN Shope, without losing specificity, and thus is a better comparative diagnostic standard. When adding a suspect range to the gX-T test, using the same criteria as the suspect range for the gX-H test, the false-positive rate of the gX-T test was reduced to 5% when evaluated versus VN Bartha gIIIKa in the infected herd and to 1% for the PRV-negative herd. However, 18% of the positive samples were classified as suspect (vs. 8% for the gX-H test). In PRV eradication programs, the cost of false negatives is greater than the cost of false positives; thus, the gX-T diagnostic used in this study is of greater diagnostic value.



2021 ◽  
Author(s):  
Tao Jiang ◽  
Martin Buchkovich ◽  
Alison Motsinger-Reif

Abstract Background: Same-species contamination detection is an important quality control step in genetic data analysis. Due to a scarcity of methods to detect and correct for this quality control issue, same-species contamination is more difficult to detect than cross-species contamination. We introduce a novel machine learning algorithm to detect same-species contamination in next-generation sequencing (NGS) data using a support vector machine (SVM) model. Our approach uniquely detects contamination using variant calling information stored in variant call format (VCF) files for DNA or RNA. Importantly, it can differentiate between same-species contamination and mixtures of tumor and normal cells.In the first stage, a change-point detection method is used to identify copy number variations (CNVs) and copy number aberrations (CNAs) for filtering. Next, single nucleotide polymorphism (SNP) data is used to test for same-species contamination using an SVM model. Based on the assumption that alternative allele frequencies in NGS follow the beta-binomial distribution, the deviation parameter ρ is estimated by the maximum likelihood method. All features of a radial basis function (RBF) kernel SVM are generated using publicly available or private training data. Results: We demonstrate our approach in simulation experiments. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generate VCF files using variants identified in these data and then evaluate the power and false-positive rate of our approach. Our approach can detect contamination levels as low as 5% with a reasonable false-positive rate. Results in real data have sensitivity above 99.99% and specificity of 90.24%, even in the presence of degraded samples with similar features as contaminated samples. We provide an R software implementation of our approach.Conclusions: Our approach addresses the gap in methods to test for same-species contamination in NGS. Due to its high sensitivity for degraded samples and tumor-normal samples, it represents an important tool that can be applied within the quality control process. Additionally, the user-friendly software has the unique ability to conduct quality control using the VCF format.



Author(s):  
David Anderson

Abstract Screening for prohibited items at airports is an example of a multi-layered screening process. Multiple layers of screening – often comprising different technologies with complementary strengths and weaknesses – are combined to create a single screening process. The detection performance of the overall system depends on multiple factors, including the performance of individual layers, the complementarity of different layers, and the decision rule(s) for determining how outputs from individual layers are combined. The aim of this work is to understand and optimise the overall system performance of a multi-layered screening process. Novel aspects include the use of realistic profiles of alarm distributions based on experimental observations and a focus on the influence of correlation/orthogonality amongst the layers of screening. The results show that a cumulative screening architecture can outperform a cascading one, yielding a significant increase in system-level true positive rate for only a modest increase in false positive rate. A cumulative screening process is also more resilient to weaknesses in the individual layers. The performance of a multi-layered screening process using a cascading approach is maximised when the false positives are orthogonal across the different layers and the true positives are correlated. The system-level performance of a cumulative screening process, on the other hand, is maximised when both false positives and true positives are as orthogonal as possible. The cost of ignoring orthogonality between screening layers is explored with some numerical examples. The underlying software model is provided in a Jupyter Notebook as supplementary material.



2019 ◽  
Author(s):  
Harriet Dashnow ◽  
Katrina M. Bell ◽  
Zornitza Stark ◽  
Tiong Y. Tan ◽  
Susan M. White ◽  
...  

AbstractIn the clinical setting, exome sequencing has become standard-of-care in diagnosing rare genetic disorders, however many patients remain unsolved. Trio sequencing has been demonstrated to produce a higher diagnostic yield than singleton (proband-only) sequencing. Parental sequencing is especially useful when a disease is suspected to be caused by a de novo variant in the proband, because parental data provide a strong filter for the majority of variants that are shared by the proband and their parents. However the additional cost of sequencing the parents makes the trio strategy uneconomical for many clinical situations. With two thirds of the sequencing budget being spent on parents, these are funds that could be used to sequence more probands. For this reason many clinics are reluctant to sequence parents.Here we propose a pooled-parent strategy for exome sequencing of individuals with likely de novo disease. In this strategy, DNA from all the parents of a cohort of unrelated probands is pooled together into a single exome capture and sequencing run. Variants called in the proband can then be filtered if they are also found in the parent pool, resulting in a shorter list of prioritised variants. To evaluate the pooled-parent strategy we performed a series of simulations by combining reads from individual exomes to imitate sample pooling. We assessed the recall and false positive rate and investigated the trade-off between pool size and recall rate. We compared the performance of GATK HaplotypeCaller individual and joint calling, and FreeBayes to genotype pooled samples. Finally, we applied a pooled-parent strategy to a set of real unsolved cases and showed that the parent pool is a powerful filter that is complementary to other commonly used variant filters such as population variant frequencies.



2018 ◽  
Author(s):  
Joseph D Valencia ◽  
Hani Z Girgis

AbstractLong terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and genomic evolution. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None are concurrent or have features to support manual review of new elements. To overcome these limitations, we developed LtrDetector, which uses signal-processing techniques. LtrDetector is easy to install and use. It is not species specific. It utilizes multi-core processors available in personal computers. It is more sensitive than other tools by 14.4%–50.8% while maintaining a low false positive rate on six plant genomes.



2017 ◽  
Author(s):  
Jacob M. Luber ◽  
Braden T. Tierney ◽  
Evan M. Cofer ◽  
Chirag J. Patel ◽  
Aleksandar D. Kostic

AbstractAcross biology we are seeing rapid developments in scale of data production without a corresponding increase in data analysis capabilities. Here, we present Aether (http://aether.kosticlab.org), an intuitive, easy-to-use, cost-effective, and scalable framework that uses linear programming (LP) to optimally bid on and deploy combinations of underutilized cloud computing resources. Our approach simultaneously minimizes the cost of data analysis while maximizing its efficiency and speed. As a test, we used Aether to de novo assemble 1572 metagenomic samples, a task it completed in merely 13 hours with cost savings of approximately 80% relative to comparable methods.



2018 ◽  
Author(s):  
Alfredo Velasco ◽  
Benjamin T. James ◽  
Vincent D. Wells ◽  
Hani Z. Girgis

ABSTRACTSimple tandem repeats, microsatellites in particular, have regulatory functions, links to several diseases, and applications in biotechnology. Sequences of thousands of species will be available soon. There is immediate need for an accurate tool for detecting microsatellites in the new genomes. The current available tools have limitations. As a remedy, we proposed Look4TRs, which is the first application of self-supervised hidden Markov models to discovering microsatellites. It adapts itself to the input genomes, balancing high sensitivity and low false positive rate. It auto-calibrates itself, freeing the user from adjusting the parameters manually, leading to consistent results across different studies. We evaluated Look4TRs on eight genomes. Based on F-measure, which combines sensitivity and false positive rate, Look4TRs outperformed TRF and MISA — the most widely-used tools — by 106% and 82%. Look4TRs outperformed the second best tool, MsDetector or Tantan, by 11%. Look4TRs represents technical advances in the annotation of microsatellites.



Sign in / Sign up

Export Citation Format

Share Document