NGSEP3: accurate variant calling across species and sequencing protocols

Daniel Tello; Juanita Gil; Cristian D Loaiza; John J Riascos; Nicolás Cardozo; Jorge Duitama

doi:10.1093/bioinformatics/btz275

NGSEP3: accurate variant calling across species and sequencing protocols

Bioinformatics ◽

10.1093/bioinformatics/btz275 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4716-4723 ◽

Cited By ~ 7

Author(s):

Daniel Tello ◽

Juanita Gil ◽

Cristian D Loaiza ◽

John J Riascos ◽

Nicolás Cardozo ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Comparative Accuracy ◽

Downstream Analysis ◽

Short Tandem

Abstract Motivation Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. Results Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. Availability and implementation NGSEP is available as open source software at http://ngsep.sf.net. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

STRipy: a graphical application for enhanced genotyping of pathogenic short tandem repeats in sequencing data

10.1101/2021.06.13.448220 ◽

2021 ◽

Author(s):

Andreas Halman ◽

Egor Dolzhenko ◽

Alicia Oshlack

Keyword(s):

Genome Sequencing ◽

Fragment Length ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Causal Variant ◽

Sequencing Data ◽

Pathogenic Variants ◽

Set Up ◽

Short Tandem

AbstractShort tandem repeats (STRs) are highly polymorphic with high mutation rates and expansions of STRs have been implicated as the causal variant in diseases. The application of genome sequencing in patients has recently allowed many new discoveries with over 50 disease causing loci known to date. There are several tools which allow genotyping of STRs from high-throughput sequencing (HTS) data. However, running these tools out of the box only allow around half of the known disease-causing loci to be genotyped, with lengths often limited to either read or fragment length which is less than the pathogenic cut-off for some diseases. While analysis tools can be customised to genotype extra loci, this requires proficiency in bioinformatics to set up, use, and analyse the resulting data, limiting their widespread usage by other researchers and clinicians.To address these issues, we have created a new software called STRipy that has an intuitive graphical interface and requires no specific skills for usage, thus significantly simplifying detection of STRs expansions from human HTS data. STRipy is able to target all known disease-causing STRs with genotyping performed with an established tool, ExpansionHunter, that is incorporated into the software. We have created additional functionality into STRipy to work with long alleles exceeding the fragment length.STRipy was validated using over 60 thousand simulated samples and was shown to work on whole genome sequencing of biological samples with pathogenic variants. Finally, we have used STRipy to acquire genotypes of pathogenic loci for thousands of samples from various populations which are provided to the user along with the data from the literature to assist with results interpretation. We believe the simplicity and breadth of STRipy will increase the testing of STR diseases in current datasets resulting in further diagnoses of rare diseases caused by STRs expansions.

Download Full-text

Whisper: Read sorting allows robust mapping of sequencing data

10.1101/240358 ◽

2017 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Debudaj-Grabysz ◽

Adam Gudyś ◽

Szymon Grabowski

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Suffix Arrays ◽

Link Type ◽

Mapping Tool ◽

Reverse Complement ◽

Comparable Accuracy

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text

Discovery and Quality Analysis of a Comprehensive Set of Structural Variants and Short Tandem Repeats

10.1101/713198 ◽

2019 ◽

Cited By ~ 2

Author(s):

David Jakubosky ◽

Erin N. Smith ◽

Matteo D’Antonio ◽

Marc Jan Bonder ◽

William W. Young Greenwald ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

Wide Spectrum ◽

Quality Analysis ◽

Structural Variants ◽

Sequencing Data ◽

Genetic Studies ◽

Variant Call ◽

Different Types ◽

Short Tandem

AbstractStructural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assembled a set of 719 deep whole genome sequencing (WGS) samples (mean 42x) from 477 distinct individuals which we used to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We used 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and developed a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.

Download Full-text

Accuracy of short tandem repeats genotyping tools in whole exome sequencing data

F1000Research ◽

10.12688/f1000research.22639.1 ◽

2020 ◽

Vol 9 ◽

pp. 200 ◽

Cited By ~ 2

Author(s):

Andreas Halman ◽

Alicia Oshlack

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Error Rate ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Short Tandem

Background: Short tandem repeats are an important source of genetic variation. They are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington's disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale; however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits, which will aid other researchers in choosing a suitable tool and parameters for analysis. Methods: The analysis was performed on the Simons Simplex Collection dataset, where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data. Results: We determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length, which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool, while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage. Conclusions: All tools have different strengths and weaknesses and the choice may depend on the application. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.

Download Full-text

A worldwide map of swine short tandem repeats and their associations with evolutionary and environmental adaptations

Genetics Selection Evolution ◽

10.1186/s12711-021-00631-4 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Zhongzi Wu ◽

Huanfa Gong ◽

Mingpeng Zhang ◽

Xinkai Tong ◽

Huashui Ai ◽

...

Keyword(s):

Environmental Variables ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Nucleotide Polymorphisms ◽

Loss Of Function ◽

Sequencing Data ◽

Wild Boars ◽

A Genome ◽

Average Annual Temperature ◽

Short Tandem

Abstract Background Short tandem repeats (STRs) are genetic markers with a greater mutation rate than single nucleotide polymorphisms (SNPs) and are widely used in genetic studies and forensics. However, most studies in pigs have focused only on SNPs or on a limited number of STRs. Results This study screened 394 deep-sequenced genomes from 22 domesticated pig breeds/populations worldwide, wild boars from both Europe and Asia, and numerous outgroup Suidaes, and identified a set of 878,967 polymorphic STRs (pSTRs), which represents the largest repository of pSTRs in pigs to date. We found multiple lines of evidence that pSTRs in coding regions were affected by purifying selection. The enrichment of trinucleotide pSTRs in coding sequences (CDS), 5′UTR and H3K4me3 regions suggests that trinucleotide STRs serve as important components in the exons and promoters of the corresponding genes. We demonstrated that, compared to SNPs, pSTRs provide comparable or even greater accuracy in determining the breed identity of individuals. We identified pSTRs that showed significant population differentiation between domestic pigs and wild boars in Asia and Europe. We also observed that some pSTRs were significantly associated with environmental variables, such as average annual temperature or altitude of the originating sites of Chinese indigenous breeds, among which we identified loss-of-function and/or expanded STRs overlapping with genes such as AHR, LAS1L and PDK1. Finally, our results revealed that several pSTRs show stronger signals in domestic pig—wild boar differentiation or association with the analysed environmental variables than the flanking SNPs within a 100-kb window. Conclusions This study provides a genome-wide high-density map of pSTRs in diverse pig populations based on genome sequencing data, enabling a more comprehensive characterization of their roles in evolutionary and environmental adaptation.

Download Full-text

Genome-wide profiling of heritable and de novo STR variations

10.1101/077727 ◽

2016 ◽

Cited By ~ 7

Author(s):

Thomas Willems ◽

Dina Zielinski ◽

Assaf Gordon ◽

Melissa Gymrek ◽

Yaniv Erlich

Keyword(s):

Tandem Repeats ◽

High Throughput Sequencing ◽

De Novo ◽

Genetic Diseases ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Short Tandem

AbstractShort tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, STRs have proven problematic to genotype from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping, haplotyping, and phasing STRs from whole genome sequencing data and report a genome-wide analysis and validation of de novo STR mutations.

Download Full-text

The Plateau Method for Forensic DNA SNP Mixture Deconvolution

10.1101/225805 ◽

2017 ◽

Cited By ~ 3

Author(s):

Darrell O. Ricke ◽

Joe Isaacson ◽

James Watkins ◽

Philip Fremont-Smith ◽

Tara Boettcher ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Nucleotide Polymorphisms ◽

Forensic Dna ◽

Dna Mixtures ◽

Single Nucleotide ◽

Mixture Deconvolution ◽

Dna Mixture ◽

Short Tandem

AbstractIdentification of individuals in complex DNA mixtures remains a challenge for forensic analysts. Recent advances in high throughput sequencing (HTS) are enabling analysis of DNA mixtures with expanded panels of Short Tandem Repeats (STRs) and/or Single Nucleotide Polymorphisms (SNPs). We present the plateau method for direct SNP DNA mixture deconvolution into sub-profiles based on differences in contributors’ DNA concentrations in the mixtures in the absence of matching reference profiles. The Plateau method can detect profiles of individuals whose contribution is as low as 1/200 in a DNA mixture (patent pending)1.

Download Full-text

Accuracy of short tandem repeats genotyping tools in whole exome sequencing data

10.1101/2020.02.03.933002 ◽

2020 ◽

Author(s):

Andreas Halman ◽

Alicia Oshlack

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Error Rate ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Short Tandem

AbstractBackgroundShort tandem repeats are important source of genetic variation, they are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington’s disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale, however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits which will aid other researchers to choose a suitable tool and parameters for analysis.MethodsThe analysis was performed on the Simons Simplex Collection dataset where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data.ResultsWe determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage.ConclusionsAll tools have different strengths and weaknesses and the choice may depend on the type of analysis. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.

Download Full-text

STRScan: targeted profiling of short tandem repeats in whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-017-1800-z ◽

2017 ◽

Vol 18 (S11) ◽

Cited By ~ 7

Author(s):

Haixu Tang ◽

Etienne Nzabarushimana

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Targeted Profiling ◽

Short Tandem

Download Full-text

Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Bioinformatics ◽

10.1093/bioinformatics/btaa059 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2725-2730

Author(s):

Keisuke Shimmura ◽

Yuki Kato ◽

Yukio Kawahara

Keyword(s):

Genome Editing ◽

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Node ◽

Single Nucleotide ◽

Target Sites

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text