scholarly journals SMaSH: Sample matching using SNPs in humans

BMC Genomics ◽  
2019 ◽  
Vol 20 (S12) ◽  
Author(s):  
Maximillian Westphal ◽  
David Frankhouser ◽  
Carmine Sonzone ◽  
Peter G. Shields ◽  
Pearlly Yan ◽  
...  

Abstract Background Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not. Methods We select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets. Results We validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification. Conclusion Our tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.

Author(s):  
Zeynep Baskurt ◽  
Scott Mastromatteo ◽  
Jiafen Gong ◽  
Richard F Wintle ◽  
Stephen W Scherer ◽  
...  

Abstract Integration of next generation sequencing data (NGS) across different research studies can improve the power of genetic association testing by increasing sample size and can obviate the need for sequencing controls. If differential genotype uncertainty across studies is not accounted for, combining data sets can produce spurious association results. We developed the Variant Integration Kit for NGS (VikNGS), a fast cross-platform software package, to enable aggregation of several data sets for rare and common variant genetic association analysis of quantitative and binary traits with covariate adjustment. VikNGS also includes a graphical user interface, power simulation functionality and data visualization tools. Availability The VikNGS package can be downloaded at http://www.tcag.ca/tools/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Phillip A. Richmond ◽  
Alice M. Kaye ◽  
Godfrain Jacques Kounkou ◽  
Tamar V. Av-Shalom ◽  
Wyeth W. Wasserman

AbstractAcross the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.Author SummaryIn the past 15 years, next generation sequencing technology has revolutionized our capacity to process and analyze DNA sequencing data. From agriculture to medicine, this technology is enabling a deeper understanding of the blueprint of life. Next generation sequencing data is composed of short sequences of DNA, referred to as “reads”, which are often shorter than 200 base pairs making them many orders of magnitude smaller than the entirety of a human genome. Gaining insights from this data has typically leveraged a reference-guided mapping approach, where the reads are aligned to a reference genome and then post-processed to gain actionable information such as presence or absence of genomic sequence, or variation between the reference genome and the sequenced sample. Many experts in the field of genomics have concluded that selecting a single, linear reference genome for mapping reads against is limiting, and several current research endeavors are focused on exploring options for improved analysis methods to unlock the full utility of sequencing data. Among these improvements are the usage of sex-matched genomes, population-specific reference genomes, and emergent graph-based reference pan-genomes. However, advanced methods that use raw DNA sequencing data to inform the choice of reference genome and guide the alignment of reads to enriched reference genomes are needed. Here we develop a method termed FlexTyper, which creates a searchable index of the short read data and enables flexible, user-guided queries to provide valuable insights without the need for reference-guided mapping. We demonstrate the utility of our method by identifying sample ancestry and sex in human whole genome sequencing data, detecting viral pathogen reads in RNA-seq data, African-enriched genome regions absent from the global reference, and HLA alleles that are complex to discern using standard read mapping. We anticipate early adoption of FlexTyper within analysis pipelines as a pre-mapping component, and further envision the bioinformatics and genomics community will leverage the tool for creative uses of sequence queries from unmapped data.


2016 ◽  
Vol 79 (4) ◽  
pp. 574-581 ◽  
Author(s):  
TRENNA BLAGDEN ◽  
WILLIAM SCHNEIDER ◽  
ULRICH MELCHER ◽  
JON DANIELS ◽  
JACQUELINE FLETCHER

ABSTRACT The Centers for Disease Control and Prevention recently emphasized the need for enhanced technologies to use in investigations of outbreaks of foodborne illnesses. To address this need, e-probe diagnostic nucleic acid analysis (EDNA) was adapted and validated as a tool for the rapid, effective identification and characterization of multiple pathogens in a food matrix. In EDNA, unassembled next generation sequencing data sets from food sample metagenomes are queried using pathogen-specific sequences known as electronic probes (e-probes). In this study, the query of mock sequence databases demonstrated the potential of EDNA for the detection of foodborne pathogens. The method was then validated using next generation sequencing data sets created by sequencing the metagenome of alfalfa sprouts inoculated with Escherichia coli O157:H7. Nonspecific hits in the negative control sample indicated the need for additional filtration of the e-probes to enhance specificity. There was no significant difference in the ability of an e-probe to detect the target pathogen based upon the length of the probe set oligonucleotides. The results from the queries of the sample database using E. coli e-probe sets were significantly different from those obtained using random decoy probe sets and exhibited 100% precision. The results support the use of EDNA as a rapid response methodology in foodborne outbreaks and investigations for establishing comprehensive microbial profiles of complex food samples.


2011 ◽  
Vol 40 (D1) ◽  
pp. D720-D728 ◽  
Author(s):  
J. Martin ◽  
S. Abubucker ◽  
E. Heizer ◽  
C. M. Taylor ◽  
M. Mitreva

Sign in / Sign up

Export Citation Format

Share Document