A benchmark of transposon insertion detection tools using real data

Abstract Background Transposable elements (TEs) are an important source of genomic variability in eukaryotic genomes. Their activity impacts genome architecture and gene expression and can lead to drastic phenotypic changes. Therefore, identifying TE polymorphisms is key to better understand the link between genotype and phenotype. However, most genotype-to-phenotype analyses have concentrated on single nucleotide polymorphisms as they are easier to reliable detect using short-read data. Many bioinformatic tools have been developed to identify transposon insertions from resequencing data using short reads. Nevertheless, the performance of most of these tools has been tested using simulated insertions, which do not accurately reproduce the complexity of natural insertions. Results We have overcome this limitation by building a dataset of insertions from the comparison of two high-quality rice genomes, followed by extensive manual curation. This dataset contains validated insertions of two very different types of TEs, LTR-retrotransposons and MITEs. Using this dataset, we have benchmarked the sensitivity and precision of 12 commonly used tools, and our results suggest that in general their sensitivity was previously overestimated when using simulated data. Our results also show that, increasing coverage leads to a better sensitivity but with a cost in precision. Moreover, we found important differences in tool performance, with some tools performing better on a specific type of TEs. We have also used two sets of experimentally validated insertions in Drosophila and humans and show that this trend is maintained in genomes of different size and complexity. Conclusions We discuss the possible choice of tools depending on the goals of the study and show that the appropriate combination of tools could be an option for most approaches, increasing the sensitivity while maintaining a good precision.

Download Full-text

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines

GigaScience ◽

10.1093/gigascience/giaa007 ◽

2020 ◽

Vol 9 (2) ◽

Cited By ~ 17

Author(s):

Stephen J Bush ◽

Dona Foster ◽

David W Eyre ◽

Emily L Clark ◽

Nicola De Maio ◽

...

Keyword(s):

Reference Genome ◽

Simulated Data ◽

Real Data ◽

Genomic Diversity ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Snp Calling ◽

Single Nucleotide Polymorphism Calling ◽

Nucleotide Divergence

Abstract Background Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

Download Full-text

Detecting Inversions with PCA in the Presence of Population Structure

10.1101/736900 ◽

2019 ◽

Author(s):

Ronald J. Nowling ◽

Krystal R. Manke ◽

Scott J. Emrich

Keyword(s):

Simulated Data ◽

Principal Component ◽

Real Data ◽

Malaria Vectors ◽

Anopheles Coluzzii ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Closely Related Species ◽

Proper Analysis

ABSTRACTChromosomal inversions are associated with reproductive isolation and adaptation in insects such as Drosophila melanogaster and the malaria vectors Anopheles gambiae and Anopheles coluzzii. While methods based on read alignment have been useful in humans for detecting inversions, these methods are less successful in insects due to long repeated sequences at the breakpoints. Alternatively, inversions can be detected using principal component analysis (PCA) of single nucleotide polymorphisms (SNPs). We apply PCA-based inversion detection to a simulated data set and real data from multiple insect species, which vary in complexity from a single inversion in samples drawn from a single population to analyzing multiple overlapping inversions occurring in closely-related species, samples of which that were generated from multiple geographic locations. We show empirically that proper analysis of these data can be challenging when multiple inversions or populations are present, and that our alternative framework is more robust in these more difficult scenarios.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

10.1101/2021.04.09.439138 ◽

2021 ◽

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Pooled Samples ◽

Haplotype Information

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

Comparison of single-nucleotide polymorphisms and microsatellite markers for linkage analysis in the COGA and simulated data sets for Genetic Analysis Workshop 14: Presentation Groups 1, 2, and 3

Genetic Epidemiology ◽

10.1002/gepi.20106 ◽

2005 ◽

Vol 29 (S1) ◽

pp. S7-S28 ◽

Cited By ~ 22

Author(s):

Marsha A. Wilcox ◽

Elizabeth W. Pugh ◽

Heping Zhang ◽

Xiaoyun Zhong ◽

Douglas F. Levinson ◽

...

Keyword(s):

Single Nucleotide Polymorphisms ◽

Linkage Analysis ◽

Genetic Analysis ◽

Microsatellite Markers ◽

Genetic Analysis Workshop ◽

Simulated Data ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Simulated Data Sets

Download Full-text

AsCRISPR: a web server for allele-specific sgRNA design in precision medicine

10.1101/672634 ◽

2019 ◽

Author(s):

Guihu Zhao ◽

Jinchen Li ◽

Yu Tang

Keyword(s):

Nucleotide Polymorphisms ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Guide Rna ◽

Inherited Diseases ◽

Bioinformatic Tools ◽

Point Of Entry ◽

Allele Specific ◽

Sgrna Design ◽

Specific Restriction

AbstractAllele-specific genomic targeting by CRISPR provides a point of entry for personalized gene therapy of dominantly inherited diseases, by selectively disrupting the mutant alleles or disease-causing single nucleotide polymorphisms (SNPs), ideally while leaving normal alleles intact. Moreover, the allele-specific engineering has been increasingly exploited not only in treating inherited diseases and mutation-driven cancers, but also in other important fields such as genome imprinting, haploinsufficiency, genome loci imaging and immunocompatible manipulations. Despite the tremendous utilities of allele-specific targeting by CRISPR, very few bioinformatic tools have been implemented for the allele-specific purpose. We thus developed AsCRISPR (Allele-specific CRISPR), a web tool to aid the design of guide RNA (gRNA) sequences that can discriminate between alleles. It provides users with limited bioinformatics skills to analyze both their own identified variants and heterozygous SNPs deposited in the dbSNP database. Multiple CRISPR nucleases and their engineered variants including newly-developed Cas12b and CasX are included for users’ choice. Meanwhile, AsCRISPR evaluates the on-target efficiencies, specificities and potential off-targets of gRNA candidates, and also displays the allele-specific restriction enzyme sites that might be disrupted upon successful genome edits. In addition, AsCRISPR analyzed with dominant single nucleotide variants (SNVs) retrieved from ClinVar and OMIM databases, and generated a Dominant Database of candidate discriminating gRNAs that may specifically target the alternative allele for each dominant SNV site. A Validated Database was also established, which manually curated the discriminating gRNAs that were experimentally validated in the mounting literatures. AsCRISPR is freely available at http://www.genemed.tech/ascrispr.

Download Full-text

HAHap: a read-based haplotyping method using hierarchical assembly

PeerJ ◽

10.7717/peerj.5852 ◽

2018 ◽

Vol 6 ◽

pp. e5852

Author(s):

Yu-Yu Lin ◽

Ping Chun Wu ◽

Pei-Lung Chen ◽

Yen-Jen Oyang ◽

Chien-Yu Chen

Keyword(s):

Simulated Data ◽

Real Data ◽

Error Rates ◽

Lower Number ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Hierarchical Assembly ◽

Sequencing Technologies ◽

Error Corrections ◽

Selection Of

Background The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. Results This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.

Download Full-text

The Identification of a 1916 Irish Rebel: New Approach for Estimating Relatedness From Low Coverage Homozygous Genomes

10.1101/076992 ◽

2016 ◽

Author(s):

Daniel Fernandes ◽

Kendra Sirak ◽

Mario Novak ◽

John Finarelli ◽

John Byrne ◽

...

Keyword(s):

Ancient Dna ◽

Simulated Data ◽

Forensic Analysis ◽

Nucleotide Polymorphisms ◽

Traditional Methods ◽

Single Nucleotide ◽

Record Keeping ◽

Easter Rising ◽

Novel Approach ◽

Low Coverage

ABSTRACTThomas Kent was an Irish rebel who was executed by British forces in the aftermath of the Easter Rising armed insurrection of 1916 and buried in a shallow grave on Cork prison's grounds. In 2015, ninety-nine years after his death, a state funeral was offered to his living family to honor his role in the struggle for Irish independence. However, inaccuracies in record keeping did not allow the bodily remains that supposedly belonged to Kent to be identified with absolute certainty. Using a novel approach based on homozygous single nucleotide polymorphisms, we identified these remains to be those of Kent by comparing his genetic data to that of two known living relatives. As the DNA degradation found on Kent's DNA, characteristic of ancient DNA, rendered traditional methods of relatedness estimation unusable, we forced all loci homozygous, in a process we refer to as “forced homozygote approach”. The results were confirmed using simulated data for different relatedness classes. We argue that this method provides a necessary alternative for relatedness estimations, not only in forensic analysis, but also in ancient DNA studies, where reduced amounts of genetic information can limit the application of traditional methods.

Download Full-text

Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies

Statistical Methods in Medical Research ◽

10.1177/0962280214551815 ◽

2014 ◽

Vol 26 (2) ◽

pp. 567-582 ◽

Cited By ~ 13

Author(s):

Zhongxue Chen ◽

Hon Keung Tony Ng ◽

Jing Li ◽

Qingzhong Liu ◽

Hanwen Huang

Keyword(s):

Single Nucleotide Polymorphisms ◽

X Chromosome ◽

Association Studies ◽

Statistical Tests ◽

Real Data ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Genome Wide

In the past decade, hundreds of genome-wide association studies have been conducted to detect the significant single-nucleotide polymorphisms that are associated with certain diseases. However, most of the data from the X chromosome were not analyzed and only a few significant associated single-nucleotide polymorphisms from the X chromosome have been identified from genome-wide association studies. This is mainly due to the lack of powerful statistical tests. In this paper, we propose a novel statistical approach that combines the information of single-nucleotide polymorphisms on the X chromosome from both males and females in an efficient way. The proposed approach avoids the need of making strong assumptions about the underlying genetic models. Our proposed statistical test is a robust method that only makes the assumption that the risk allele is the same for both females and males if the single-nucleotide polymorphism is associated with the disease for both genders. Through simulation study and a real data application, we show that the proposed procedure is robust and have excellent performance compared to existing methods. We expect that many more associated single-nucleotide polymorphisms on the X chromosome will be identified if the proposed approach is applied to current available genome-wide association studies data.

Download Full-text

EpiGEN: an epistasis simulation pipeline

Bioinformatics ◽

10.1093/bioinformatics/btaa245 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4957-4959

Author(s):

David B Blumenthal ◽

Lorenzo Viola ◽

Markus List ◽

Jan Baumbach ◽

Paolo Tieri ◽

...

Keyword(s):

Arbitrary Order ◽

Association Studies ◽

Simulated Data ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Supplementary Data ◽

Single Nucleotide ◽

Genome Wide

Abstract Summary Simulated data are crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns and generates both categorical and quantitative phenotypes. Availability and implementation EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Atria: An Ultra-fast and Accurate Trimmer for Adapter and Quality Trimming

10.1101/2021.09.07.459340 ◽

2021 ◽

Author(s):

Jiacheng Chuan ◽

Aiguo Zhou ◽

Lawrence Richard Hale ◽

Miao He ◽

Xiang Li

Keyword(s):

Dominant Role ◽

Source Code ◽

Real Data ◽

Sequence Length ◽

Nucleotide Polymorphisms ◽

Sequence Matching ◽

Single Nucleotide ◽

Matching Algorithm ◽

Downstream Analysis ◽

Generation Sequencing

AbstractBackgroundAs Next Generation Sequencing takes a dominant role in terms of output capacity and sequence length, adapters attached to the reads and low-quality bases hinder the performance of downstream analysis directly and implicitly, such as producing false-positive single nucleotide polymorphisms (SNP), and generating fragmented assemblies. A fast trimming algorithm is in demand to remove adapters precisely, especially in read tails with relatively low quality.FindingsWe present a trimming program named Atria. Atria matches the adapters in paired reads and finds possible overlapped regions with a super-fast and carefully designed byte-based matching algorithm (O(n) time with O(1) space). Atria also implements multi-threading in both sequence processing and file compression and supports single-end reads.ConclusionsAtria performs favorably in various trimming and runtime benchmarks of both simulated and real data with other cutting-edge trimmers. We also provide an ultra-fast and lightweight byte-based matching algorithm. The algorithm can be used in a broad range of short-sequence matching applications, such as primer search and seed scanning before alignment.Availability & ImplementationThe Atria executables, source code, and benchmark scripts are available at https://github.com/cihga39871/Atria under the MIT license.

Download Full-text