Comparison of sequencing data processing pipelines and application to underrepresented African human populations

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text

Accurate fetal variant calling in the presence of maternal cell contamination

10.1101/552414 ◽

2019 ◽

Cited By ~ 1

Author(s):

Elena Nabieva ◽

Satyarth Mishra Sharma ◽

Yermek Kapushev ◽

Sofya K. Garushyants ◽

Anna V. Fedotova ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Chorionic Villus ◽

Genetic Diagnosis ◽

Variant Calling ◽

Data Availability ◽

Training Data ◽

Sequencing Data ◽

Maternal Cell ◽

Fetal Dna

AbstractHigh-throughput sequencing of fetal DNA is a promising and increasingly common method for the discovery of all (or all coding) genetic variants in the fetus, either as part of prenatal screening or diagnosis, or for genetic diagnosis of spontaneous abortions. In many cases, the fetal DNA (from chorionic villi, amniotic fluid, or abortive tissue) can be contaminated with maternal cells, resulting in the mixture of fetal and maternal DNA. This maternal cell contamination (MCC) undermines the assumption, made by traditional variant callers, that each allele in a heterozygous site is covered, on average, by 50% of the reads, and therefore can lead to erroneous genotype calls. We present a panel of methods for reducing the genotyping error in the presence of MCC. All methods start with the output of GATK HaplotypeCaller on the sequencing data for the (contaminated) fetal sample and both of its parents, and additionally rely on information about the MCC fraction (which itself is readily estimated from the high-throughput sequencing data). The first of these methods uses a Bayesian probabilistic model to correct the fetal genotype calls produced by MCC-unaware HaplotypeCaller. The other two methods “learn” the genotype-correction model from examples. We use simulated contaminated fetal data to train and test the models. Using the test sets, we show that all three methods lead to substantially improved accuracy when compared with the original MCC-unaware HaplotypeCaller calls. We then apply the best-performing method to three chorionic villus samples from spontaneously terminated pregnancies.Code and training data availabilityhttps://github.com/bazykinlab/ML-maternal-cell-contamination

Download Full-text

MitoSuite: a graphical tool for human mitochondrial genome profiling in massive parallel sequencing

PeerJ ◽

10.7717/peerj.3406 ◽

2017 ◽

Vol 5 ◽

pp. e3406 ◽

Cited By ~ 12

Author(s):

Koji Ishiya ◽

Shintaroh Ueda

Keyword(s):

Mitochondrial Genome ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Coverage ◽

Graphical Tool ◽

Genome Variations ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Human Mitochondrial Genome

Recent rapid advances in high-throughput, next-generation sequencing (NGS) technologies have promoted mitochondrial genome studies in the fields of human evolution, medical genetics, and forensic casework. However, scientists unfamiliar with computer programming often find it difficult to handle the massive volumes of data that are generated by NGS. To address this limitation, we developed MitoSuite, a user-friendly graphical tool for analysis of data from high-throughput sequencing of the human mitochondrial genome. MitoSuite generates a visual report on NGS data with simple mouse operations. Moreover, it analyzes high-coverage sequencing data but runs on a stand-alone computer, without the need for file upload. Therefore, MitoSuite offers outstanding usability for handling massive NGS data, and is ideal for evolutionary, clinical, and forensic studies on the human mitochondrial genome variations. It is freely available for download from the website https://mitosuite.com.

Download Full-text

iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data

BMC Systems Biology ◽

10.1186/1752-0509-7-s6-s8 ◽

2013 ◽

Vol 7 (Suppl 6) ◽

pp. S8 ◽

Cited By ~ 21

Author(s):

Takahiro Mimori ◽

Naoki Nariai ◽

Kaname Kojima ◽

Mamoru Takahashi ◽

Akira Ono ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Sequencing Data ◽

Structural Variant ◽

High Throughput Sequencing Data

Download Full-text

Bazam: A rapid method for read extraction and realignment of high throughput sequencing data

10.1101/433003 ◽

2018 ◽

Cited By ~ 1

Author(s):

Simon P Sadedin ◽

Alicia Oshlack

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Genomic Data ◽

Selective Extraction ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Time Required ◽

Genomic Regions ◽

Reference Genomes

AbstractBackgroundAs costs of high throughput sequencing have fallen, we are seeing vast quantities of short read genomic data being generated. Often, the data is exchanged and stored as aligned reads, which provides high compression and convenient access for many analyses. However, aligned data becomes outdated as new reference genomes and alignment methods become available. Moreover, some applications cannot utilise pre-aligned reads at all, necessitating conversion back to raw format (FASTQ) before they can be used. In both cases, the process of extraction and realignment is expensive and time consuming.FindingsWe describe Bazam, a tool that efficiently extracts the original paired FASTQ from reads stored in aligned form (BAM or CRAM format). Bazam extracts reads in a format that directly allows realignment with popular aligners with high concurrency. Through eliminating steps and increasing the accessible concurrency, Bazam facilitates up to a 90% reduction in the time required for realignment compared to standard methods. Bazam can support selective extraction of read pairs from focused genomic regions, further increasing efficiency for targeted analyses. Bazam is additionally suitable as a base for other applications that require efficient paired read information, such as quality control, structural variant calling and alignment comparison.ConclusionsBazam offers significant improvements for users needing to realign genomic data.

Download Full-text

High throughput sequencing provides exact genomic locations of inducible prophages and accurate phage-to-host ratios in gut microbial strains

Microbiome ◽

10.1186/s40168-021-01033-w ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Mirjam Zünd ◽

Hans-Joachim Ruscheweyh ◽

Christopher M. Field ◽

Natalie Meyer ◽

Miguelangel Cuenca ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Bioinformatic Analysis ◽

Quantitative Information ◽

Sequencing Data ◽

Bacterial Strains ◽

Bacterial Populations ◽

Bioinformatic Tools ◽

High Throughput Sequencing Data ◽

Serovar Typhimurium

Abstract Background Temperate phages influence the density, diversity and function of bacterial populations. Historically, they have been described as carriers of toxins. More recently, they have also been recognised as direct modulators of the gut microbiome, and indirectly of host health and disease. Despite recent advances in studying prophages using non-targeted sequencing approaches, methodological challenges in identifying inducible prophages in bacterial genomes and quantifying their activity have limited our understanding of prophage-host interactions. Results We present methods for using high-throughput sequencing data to locate inducible prophages, including those previously undiscovered, to quantify prophage activity and to investigate their replication. We first used the well-established Salmonella enterica serovar Typhimurium/p22 system to validate our methods for (i) quantifying phage-to-host ratios and (ii) accurately locating inducible prophages in the reference genome based on phage-to-host ratio differences and read alignment alterations between induced and non-induced prophages. Investigating prophages in bacterial strains from a murine gut model microbiota known as Oligo-MM12 or sDMDMm2, we located five novel inducible prophages in three strains, quantified their activity and showed signatures of lateral transduction potential for two of them. Furthermore, we show that the methods were also applicable to metagenomes of induced faecal samples from Oligo-MM12 mice, including for strains with a relative abundance below 1%, illustrating its potential for the discovery of inducible prophages also in more complex metagenomes. Finally, we show that predictions of prophage locations in reference genomes of the strains we studied were variable and inconsistent for four bioinformatic tools we tested, which highlights the importance of their experimental validation. Conclusions This study demonstrates that the integration of experimental induction and bioinformatic analysis presented here is a powerful approach to accurately locate inducible prophages using high-throughput sequencing data and to quantify their activity. The ability to generate such quantitative information will be critical in helping us to gain better insights into the factors that determine phage activity and how prophage-bacteria interactions influence our microbiome and impact human health.

Download Full-text

Comparison of Sequencing Utility Programs

The Open Bioinformatics Journal ◽

10.2174/1875036201307010001 ◽

2013 ◽

Vol 7 (1) ◽

pp. 1-8 ◽

Cited By ~ 596

Author(s):

Erik Aronesty

Keyword(s):

High Throughput ◽

Growth Rates ◽

High Throughput Sequencing ◽

Variant Calling ◽

Sequencing Data ◽

End Joining ◽

Efficiency And Effectiveness ◽

Data Output ◽

Adapter Trimming ◽

Expression Quantification

High throughput sequencing (HTS) has resulted in extreme growth rates of sequencing data. At our lab, we generate terabytes of data every day. It is usually seen as required for data output to be “cleaned” and processed in various ways prior to use for common tasks such as variant calling, expression quantification and assembly. Two common tasks associated with HTS are adapter trimming and paired-end joining. I have developed two tools at Expression Analysis, Inc. to address these common tasks. The names of these programs are fastq-mcf and fastq-join. I compared the performance of these tools to similar open-source utilities, both in terms of resource efficiency, and effectiveness.

Download Full-text

Detecting Selection in Low-Coverage High-Throughput Sequencing Data using Principal Component Analysis

10.1101/2021.03.01.432540 ◽

2021 ◽

Author(s):

Jonas Meisner ◽

Anders Albrechtsen ◽

Kristian Hanghøj

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

False Positive Rate ◽

Principal Component ◽

Human Populations ◽

Population Genetic Study ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Positive Rate ◽

Low Coverage

1AbstractIdentification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Moreover, we show that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data.

Download Full-text

A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses

Microorganisms ◽

10.3390/microorganisms9040841 ◽

2021 ◽

Vol 9 (4) ◽

pp. 841

Author(s):

Denis Kutnjak ◽

Lucie Tamisier ◽

Ian Adams ◽

Neil Boonham ◽

Thierry Candresse ◽

...

Keyword(s):

Data Analysis ◽

High Throughput ◽

Plant Virus ◽

High Throughput Sequencing ◽

Plant Viruses ◽

Acid Extraction ◽

Sequencing Data ◽

Bioinformatic Tools ◽

Advantages And Disadvantages ◽

Plant Virus Detection

High-throughput sequencing (HTS) technologies have become indispensable tools assisting plant virus diagnostics and research thanks to their ability to detect any plant virus in a sample without prior knowledge. As HTS technologies are heavily relying on bioinformatics analysis of the huge amount of generated sequences, it is of utmost importance that researchers can rely on efficient and reliable bioinformatic tools and can understand the principles, advantages, and disadvantages of the tools used. Here, we present a critical overview of the steps involved in HTS as employed for plant virus detection and virome characterization. We start from sample preparation and nucleic acid extraction as appropriate to the chosen HTS strategy, which is followed by basic data analysis requirements, an extensive overview of the in-depth data processing options, and taxonomic classification of viral sequences detected. By presenting the bioinformatic tools and a detailed overview of the consecutive steps that can be used to implement a well-structured HTS data analysis in an easy and accessible way, this paper is targeted at both beginners and expert scientists engaging in HTS plant virome projects.

Download Full-text

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

BMC Bioinformatics ◽

10.1186/s12859-021-04317-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jochen Bathke ◽

Gesine Lühken

Keyword(s):

Best Practices ◽

Open Source ◽

High Throughput Sequencing ◽

Variant Calling ◽

Model Organisms ◽

Phenotypic Trait ◽

Sequencing Data ◽

Wide Range ◽

Major Player ◽

Over Time

Abstract Background The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. Results A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. Conclusions The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.

Download Full-text