scholarly journals SpeedSeq: Ultra-fast personal genome analysis and interpretation

2014 ◽  
Author(s):  
Colby Chiang ◽  
Ryan M Layer ◽  
Gregory G Faust ◽  
Michael R Lindberg ◽  
David B Rose ◽  
...  

Comprehensive interpretation of human genome sequencing data is a challenging bioinformatic problem that typically requires weeks of analysis, with extensive hands-on expert involvement. This informatics bottleneck inflates genome sequencing costs, poses a computational burden for large-scale projects, and impedes the adoption of time-critical clinical applications such as personalized cancer profiling and newborn disease diagnosis, where the actionable timeframe can measure in hours or days. We developed SpeedSeq, an open-source genome analysis platform that vastly reduces computing time. SpeedSeq accomplishes read alignment, duplicate removal, variant detection and functional annotation of a 50X human genome in <24 hours, even using one low-cost server. SpeedSeq offers competitive or superior performance to current methods for detecting germline and somatic single nucleotide variants (SNVs), indels, and structural variants (SVs) and includes novel functionality for SV genotyping, SV annotation, fusion gene detection, and rapid identification of actionable mutations. SpeedSeq will help bring timely genome analysis into the clinical realm.

2020 ◽  
Vol 66 (1) ◽  
pp. 39-52
Author(s):  
Tomoya Tanjo ◽  
Yosuke Kawai ◽  
Katsushi Tokunaga ◽  
Osamu Ogasawara ◽  
Masao Nagasaki

AbstractStudies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.


2019 ◽  
Vol 3 (4) ◽  
pp. 399-409 ◽  
Author(s):  
Brandon Jew ◽  
Jae Hoon Sul

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.


2021 ◽  
Author(s):  
Einar Gabbasov ◽  
Miguel Moreno-Molina ◽  
Iñaki Comas ◽  
Maxwell Libbrecht ◽  
Leonid Chindelevitch

AbstractThe occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. difficile within a single human host, referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detecting it, and especially determining the proportion and identities of the underlying strains, from WGS (whole-genome sequencing) data, have been limited.In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the underlying strains.We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for data coming from bacterial pathogens, and holds the promise of enabling previously inaccessible conclusions to be drawn in the realm of public health microbiology.Author summaryWhen multiple strains of a pathogenic organism are present in a patient, it may be necessary to not only detect this, but also to identify the individual strains. However, this problem has not yet been solved for bacterial pathogens processed via whole-genome sequencing. In this paper, we propose the SplitStrains algorithm for detecting multiple strains in a sample, identifying their proportions, and inferring their sequences, in the case of Mycobacterium tuberculosis. We test it on both simulated and real data, with encouraging results. We believe that our work opens new horizons in public health microbiology by allowing a more precise detection, identification and quantification of multiple infecting strains within a sample.


2021 ◽  
Author(s):  
Matthew G Durrant ◽  
Alison Fanton ◽  
Josh Tycko ◽  
Michaela Hinks ◽  
Sita Chandrasekaran ◽  
...  

Recent microbial genome sequencing efforts have revealed a vast reservoir of mobile genetic elements containing integrases that could be useful genome engineering tools. Large serine recombinases (LSRs), such as Bxb1 and PhiC31, are bacteriophage-encoded integrases that can facilitate the insertion of phage DNA into bacterial genomes. However, only a few LSRs have been previously characterized and they have limited efficiency in human cells. Here, we developed a systematic computational discovery workflow that searches across the bacterial tree of life to expand the diversity of known LSRs and their cognate DNA attachment sites by >100-fold. We validated this approach via experimental characterization of LSRs, leading to three classes of LSRs distinguished from one another by their efficiency and specificity. We identify landing pad LSRs that efficiently integrate into native attachment sites in a human cell context, human genome-targeting LSRs with computationally predictable pseudosites, and multi-targeting LSRs that can unidirectionally integrate cargos with similar efficiency and superior specificity to commonly used transposases. LSRs from each category were functionally characterized in human cells, overall achieving up to 7-fold higher plasmid recombination than Bxb1 and genome insertion efficiencies of 40-70% with cargo sizes over 7 kb. Overall, we establish a paradigm for the large-scale discovery of microbial recombinases directly from sequencing data and the reconstruction of their target sites. This strategy provided a rich resource of over 60 experimentally characterized LSRs that can function in human cells and thousands of additional candidates for large-payload genome editing without double-stranded DNA breaks.


2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


Genes ◽  
2020 ◽  
Vol 11 (12) ◽  
pp. 1444
Author(s):  
Nazeefa Fatima ◽  
Anna Petri ◽  
Ulf Gyllensten ◽  
Lars Feuk ◽  
Adam Ameur

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Giulio Caravagna ◽  
Guido Sanguinetti ◽  
Trevor A. Graham ◽  
Andrea Sottoriva

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.


Sign in / Sign up

Export Citation Format

Share Document