SpeedSeq: Ultra-fast personal genome analysis and interpretation

Comprehensive interpretation of human genome sequencing data is a challenging bioinformatic problem that typically requires weeks of analysis, with extensive hands-on expert involvement. This informatics bottleneck inflates genome sequencing costs, poses a computational burden for large-scale projects, and impedes the adoption of time-critical clinical applications such as personalized cancer profiling and newborn disease diagnosis, where the actionable timeframe can measure in hours or days. We developed SpeedSeq, an open-source genome analysis platform that vastly reduces computing time. SpeedSeq accomplishes read alignment, duplicate removal, variant detection and functional annotation of a 50X human genome in <24 hours, even using one low-cost server. SpeedSeq offers competitive or superior performance to current methods for detecting germline and somatic single nucleotide variants (SNVs), indels, and structural variants (SVs) and includes novel functionality for SV genotyping, SV annotation, fusion gene detection, and rapid identification of actionable mutations. SpeedSeq will help bring timely genome analysis into the clinical realm.

Download Full-text

Practical guide for managing large-scale human genome data in research

Journal of Human Genetics ◽

10.1038/s10038-020-00862-1 ◽

2020 ◽

Vol 66 (1) ◽

pp. 39-52

Author(s):

Tomoya Tanjo ◽

Yosuke Kawai ◽

Katsushi Tokunaga ◽

Osamu Ogasawara ◽

Masao Nagasaki

Keyword(s):

Data Processing ◽

Whole Genome Sequencing ◽

Human Genome ◽

Genome Sequencing ◽

Large Scale ◽

Genomic Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Data ◽

Human Genome Data

AbstractStudies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing ◽

10.1109/ccgrid.2014.111 ◽

2014 ◽

Cited By ~ 4

Author(s):

Mucahid Kutlu ◽

Gagan Agrawal

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Sequencing Data ◽

Snp Calling

Download Full-text

SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data

10.1101/2021.02.07.21250981 ◽

2021 ◽

Author(s):

Einar Gabbasov ◽

Miguel Moreno-Molina ◽

Iñaki Comas ◽

Maxwell Libbrecht ◽

Leonid Chindelevitch

Keyword(s):

Public Health ◽

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Bacterial Pathogens ◽

Superior Performance ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Multiple Strains

AbstractThe occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. difficile within a single human host, referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detecting it, and especially determining the proportion and identities of the underlying strains, from WGS (whole-genome sequencing) data, have been limited.In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the underlying strains.We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for data coming from bacterial pathogens, and holds the promise of enabling previously inaccessible conclusions to be drawn in the realm of public health microbiology.Author summaryWhen multiple strains of a pathogenic organism are present in a patient, it may be necessary to not only detect this, but also to identify the individual strains. However, this problem has not yet been solved for bacterial pathogens processed via whole-genome sequencing. In this paper, we propose the SplitStrains algorithm for detecting multiple strains in a sample, identifying their proportions, and inferring their sequences, in the case of Mycobacterium tuberculosis. We test it on both simulated and real data, with encouraging results. We believe that our work opens new horizons in public health microbiology by allowing a more precise detection, identification and quantification of multiple infecting strains within a sample.

Download Full-text

Large-scale discovery of recombinases for integrating DNA into the human genome

10.1101/2021.11.05.467528 ◽

2021 ◽

Author(s):

Matthew G Durrant ◽

Alison Fanton ◽

Josh Tycko ◽

Michaela Hinks ◽

Sita Chandrasekaran ◽

...

Keyword(s):

Human Genome ◽

Large Scale ◽

Genome Engineering ◽

Human Cells ◽

Sequencing Data ◽

Dna Breaks ◽

Serine Recombinases ◽

Attachment Sites ◽

Dna Attachment ◽

Computational Discovery

Recent microbial genome sequencing efforts have revealed a vast reservoir of mobile genetic elements containing integrases that could be useful genome engineering tools. Large serine recombinases (LSRs), such as Bxb1 and PhiC31, are bacteriophage-encoded integrases that can facilitate the insertion of phage DNA into bacterial genomes. However, only a few LSRs have been previously characterized and they have limited efficiency in human cells. Here, we developed a systematic computational discovery workflow that searches across the bacterial tree of life to expand the diversity of known LSRs and their cognate DNA attachment sites by >100-fold. We validated this approach via experimental characterization of LSRs, leading to three classes of LSRs distinguished from one another by their efficiency and specificity. We identify landing pad LSRs that efficiently integrate into native attachment sites in a human cell context, human genome-targeting LSRs with computationally predictable pseudosites, and multi-targeting LSRs that can unidirectionally integrate cargos with similar efficiency and superior specificity to commonly used transposases. LSRs from each category were functionally characterized in human cells, overall achieving up to 7-fold higher plasmid recombination than Bxb1 and genome insertion efficiencies of 40-70% with cargo sizes over 7 kb. Overall, we establish a paradigm for the large-scale discovery of microbial recombinases directly from sequencing data and the reconstruction of their target sites. This strategy provided a rich resource of over 60 experimentally characterized LSRs that can function in human cells and thousands of additional candidates for large-payload genome editing without double-stranded DNA breaks.

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text

The MOBSTER R package for tumour subclonal deconvolution from bulk DNA whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03863-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Giulio Caravagna ◽

Guido Sanguinetti ◽

Trevor A. Graham ◽

Andrea Sottoriva

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

R Package ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Evolutionary Forces ◽

Evolutionary Trajectories ◽

Cancer Tissues

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.

Download Full-text

Privacy Protection in Sharing Personal Genome Sequencing Data

2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology ◽

10.1109/hisb.2012.68 ◽

2012 ◽

Cited By ~ 1

Author(s):

XiaoFeng Wang ◽

Haixu Tang

Keyword(s):

Genome Sequencing ◽

Privacy Protection ◽

Personal Genome ◽

Sequencing Data ◽

Personal Genome Sequencing

Download Full-text