Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text

The MOBSTER R package for tumour subclonal deconvolution from bulk DNA whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03863-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Giulio Caravagna ◽

Guido Sanguinetti ◽

Trevor A. Graham ◽

Andrea Sottoriva

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

R Package ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Evolutionary Forces ◽

Evolutionary Trajectories ◽

Cancer Tissues

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.

Download Full-text

Translating Genomics to the Clinic: Implications of Cancer Heterogeneity

Clinical Chemistry ◽

10.1373/clinchem.2012.184580 ◽

2013 ◽

Vol 59 (1) ◽

pp. 127-137 ◽

Cited By ~ 13

Author(s):

Nardin Samuel ◽

Thomas J Hudson

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Clinical Decision Making ◽

Cancer Genomics ◽

Clinical Decision ◽

Sequencing Analysis ◽

Breast Cancers ◽

Sequencing Data ◽

Cancer Heterogeneity ◽

Cancer Genomes

BACKGROUND Sequencing of cancer genomes has become a pivotal method for uncovering and understanding the deregulated cellular processes driving tumor initiation and progression. Whole-genome sequencing is evolving toward becoming less costly and more feasible on a large scale; consequently, thousands of tumors are being analyzed with these technologies. Interpreting these data in the context of tumor complexity poses a challenge for cancer genomics. CONTENT The sequencing of large numbers of tumors has revealed novel insights into oncogenic mechanisms. In particular, we highlight the remarkable insight into the pathogenesis of breast cancers that has been gained through comprehensive and integrated sequencing analysis. The analysis and interpretation of sequencing data, however, must be considered in the context of heterogeneity within and among tumor samples. Only by adequately accounting for the underlying complexity of cancer genomes will the potential of genome sequencing be understood and subsequently translated into improved management of patients. SUMMARY The paradigm of personalized medicine holds promise if patient tumors are thoroughly studied as unique and heterogeneous entities and clinical decisions are made accordingly. Associated challenges will be ameliorated by continued collaborative efforts among research centers that coordinate the sharing of mutation, intervention, and outcomes data to assist in the interpretation of genomic data and to support clinical decision-making.

Download Full-text

Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

BMC Genomics ◽

10.1186/1471-2164-14-425 ◽

2013 ◽

Vol 14 (1) ◽

pp. 425 ◽

Cited By ~ 32

Author(s):

Shanrong Zhao ◽

Kurt Prenger ◽

Lance Smith ◽

Thomas Messina ◽

Hongtao Fan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

Reducing INDEL calling errors in whole-genome and exome sequencing data

10.1101/006148 ◽

2014 ◽

Cited By ~ 2

Author(s):

Han Fang ◽

Yiyang Wu ◽

Giuseppe Narzisi ◽

Jason A. O'Rawe ◽

Laura T. Jimenez Barrón ◽

...

Keyword(s):

Exome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Published Data ◽

Whole Genome ◽

Sequencing Data ◽

High Quality ◽

Indel Detection ◽

Validation Experiment ◽

Large Indels

BackgroundINDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.MethodsWe characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low quality INDELs (7% vs. 51%).ResultsSimulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (52%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (85% vs. 54%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.ConclusionsOverall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (e.g. capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.

Download Full-text

Multi-platform discovery of haplotype-resolved structural variation in human genomes

10.1101/193144 ◽

2017 ◽

Cited By ~ 32

Author(s):

Mark J.P. Chaisson ◽

Ashley D. Sanders ◽

Xuefang Zhao ◽

Ankit Malhotra ◽

David Porubsky ◽

...

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Structural Variation ◽

High Throughput Sequencing ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Full Spectrum ◽

Variant Discovery ◽

Sequencing Technologies ◽

Sequencing Studies

ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.

Download Full-text

Practical guide for managing large-scale human genome data in research

Journal of Human Genetics ◽

10.1038/s10038-020-00862-1 ◽

2020 ◽

Vol 66 (1) ◽

pp. 39-52

Author(s):

Tomoya Tanjo ◽

Yosuke Kawai ◽

Katsushi Tokunaga ◽

Osamu Ogasawara ◽

Masao Nagasaki

Keyword(s):

Data Processing ◽

Whole Genome Sequencing ◽

Human Genome ◽

Genome Sequencing ◽

Large Scale ◽

Genomic Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Data ◽

Human Genome Data

AbstractStudies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.

Download Full-text

PaperBLAST: Text Mining Papers for Information about Homologs

mSystems ◽

10.1128/msystems.00039-17 ◽

2017 ◽

Vol 2 (4) ◽

Cited By ~ 43

Author(s):

Morgan N. Price ◽

Adam P. Arkin

Keyword(s):

Genome Sequencing ◽

Full Text ◽

Large Scale ◽

Protein Sequences ◽

Similar Function ◽

Sequencing Data ◽

Protein Coding ◽

Link Protein ◽

Or Gene ◽

Standard Software

ABSTRACT With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions. Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/ . IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.

Download Full-text