Haplotype Threading: Accurate Polyploid Phasing from Long Reads

AbstractResolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. As a highly complex computational problem, polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WhatsHap polyphase, a novel two-stage approach that addresses these challenges by (i) clustering reads using a position-dependent scoring function and (ii) threading the haplotypes through the clusters by dynamic programming. We demonstrate on a simulated data set that this results in accurate haplotypes with switch error rates that are around three times lower than those obtainable by the current state-of-the-art and even around seven times lower in regions of collapsing haplotypes. Using a real data set comprising long and short read tetraploid potato sequencing data we show that WhatsHap polyphase is able to phase the majority of the potato genes after error correction, which enables the assembly of local genomic regions of interest at haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap and ready to be included in production settings.

Download Full-text

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

BioMed Research International ◽

10.1155/2016/4986707 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Qiang Yu ◽

Hongwei Huo ◽

Dazheng Feng

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

High Throughput Sequencing ◽

Hamming Distance ◽

Simulated Data ◽

Real Data ◽

Identification Accuracy ◽

Data Sets ◽

Sequencing Data ◽

Data Set

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

Download Full-text

Haplotype threading: accurate polyploid phasing from long reads

Genome Biology ◽

10.1186/s13059-020-02158-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Sven D. Schrinner ◽

Rebecca Serra Mari ◽

Jana Ebler ◽

Mikko Rautiainen ◽

Lancelot Seillier ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

State Of The Art ◽

Regions Of Interest ◽

Breeding Strategies ◽

Polyploid Species ◽

Two Stage ◽

Long Reads ◽

History Of ◽

Genomic Regions

Abstract Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WhatsHap polyphase, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.

Download Full-text

Neural Networks for the Joint Development of Individual Payments and Claim Incurred

Risks ◽

10.3390/risks8020033 ◽

2020 ◽

Vol 8 (2) ◽

pp. 33

Author(s):

Łukasz Delong ◽

Mario V. Wüthrich

Keyword(s):

Neural Networks ◽

Regression Models ◽

Development Process ◽

Real Data ◽

Data Set ◽

Joint Development ◽

Chain Ladder Method ◽

History Of ◽

Feature Information ◽

Chain Ladder

The goal of this paper is to develop regression models and postulate distributions which can be used in practice to describe the joint development process of individual claim payments and claim incurred. We apply neural networks to estimate our regression models. As regressors we use the whole claim history of incremental payments and claim incurred, as well as any relevant feature information which is available to describe individual claims and their development characteristics. Our models are calibrated and tested on a real data set, and the results are benchmarked with the Chain-Ladder method. Our analysis focuses on the development of the so-called Reported But Not Settled (RBNS) claims. We show benefits of using deep neural network and the whole claim history in our prediction problem.

Download Full-text

Common risk difference test and interval estimation of risk difference for stratified bilateral correlated data

Statistical Methods in Medical Research ◽

10.1177/0962280218781988 ◽

2018 ◽

Vol 28 (8) ◽

pp. 2418-2438

Author(s):

Xi Shen ◽

Chang-Xing Ma ◽

Kam C Yuen ◽

Guo-Liang Tian

Keyword(s):

Data Analysis ◽

Confidence Interval ◽

Score Test ◽

Real Data ◽

Risk Difference ◽

Error Rates ◽

Correlated Data ◽

Test Statistic ◽

Data Set ◽

Intra Class Correlation

Bilateral correlated data are often encountered in medical researches such as ophthalmologic (or otolaryngologic) studies, in which each unit contributes information from paired organs to the data analysis, and the measurements from such paired organs are generally highly correlated. Various statistical methods have been developed to tackle intra-class correlation on bilateral correlated data analysis. In practice, it is very important to adjust the effect of confounder on statistical inferences, since either ignoring the intra-class correlation or confounding effect may lead to biased results. In this article, we propose three approaches for testing common risk difference for stratified bilateral correlated data under the assumption of equal correlation. Five confidence intervals of common difference of two proportions are derived. The performance of the proposed test methods and confidence interval estimations is evaluated by Monte Carlo simulations. The simulation results show that the score test statistic outperforms other statistics in the sense that the former has robust type [Formula: see text] error rates with high powers. The score confidence interval induced from the score test statistic performs satisfactorily in terms of coverage probabilities with reasonable interval widths. A real data set from an otolaryngologic study is used to illustrate the proposed methodologies.

Download Full-text

Selecting among Alternative Scenarios of Human Evolution by Simulated Genetic Gradients

Genes ◽

10.3390/genes9100506 ◽

2018 ◽

Vol 9 (10) ◽

pp. 506 ◽

Cited By ~ 2

Author(s):

Catarina Branco ◽

Miguel Arenas

Keyword(s):

Human Evolution ◽

Simulated Data ◽

Real Data ◽

Population Admixture ◽

Evolutionary Processes ◽

Long Distance ◽

History Of ◽

Pros And Cons ◽

Alternative Scenarios ◽

And Migration

Selecting among alternative scenarios of human evolution is nowadays a common methodology to investigate the history of our species. This strategy is usually based on computer simulations of genetic data under different evolutionary scenarios, followed by a fitting of the simulated data with the real data. A recent trend in the investigation of ancestral evolutionary processes of modern humans is the application of genetic gradients as a measure of fitting, since evolutionary processes such as range expansions, range contractions, and population admixture (among others) can lead to different genetic gradients. In addition, this strategy allows the analysis of the genetic causes of the observed genetic gradients. Here, we review recent findings on the selection among alternative scenarios of human evolution based on simulated genetic gradients, including pros and cons. First, we describe common methodologies to simulate genetic gradients and apply them to select among alternative scenarios of human evolution. Next, we review previous studies on the influence of range expansions, population admixture, last glacial period, and migration with long-distance dispersal on genetic gradients for some regions of the world. Finally, we discuss this analytical approach, including technical limitations, required improvements, and advice. Although here we focus on human evolution, this approach could be extended to study other species.

Download Full-text

ELECTOR: evaluator for long reads correction methods

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz015 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 4

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

Error Rates ◽

Sequencing Data ◽

Multiple Sequence ◽

Long Reads ◽

Wide Range ◽

Unique Method ◽

Algorithmic Strategy ◽

Downstream Processes

Abstract The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.

Download Full-text

csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows

Nucleic Acids Research ◽

10.1093/nar/gkv1191 ◽

2015 ◽

Vol 44 (5) ◽

pp. e45-e45 ◽

Cited By ~ 120

Author(s):

Aaron T.L. Lun ◽

Gordon K. Smyth

Keyword(s):

De Novo ◽

Massively Parallel Sequencing ◽

Real Data ◽

Sequencing Data ◽

Scientific Application ◽

Sliding Windows ◽

Treatment Conditions ◽

Bioconductor Project ◽

Differential Binding ◽

Genomic Regions

Abstract Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) is widely used to identify binding sites for a target protein in the genome. An important scientific application is to identify changes in protein binding between different treatment conditions, i.e. to detect differential binding. This can reveal potential mechanisms through which changes in binding may contribute to the treatment effect. The csaw package provides a framework for the de novo detection of differentially bound genomic regions. It uses a window-based strategy to summarize read counts across the genome. It exploits existing statistical software to test for significant differences in each window. Finally, it clusters windows into regions for output and controls the false discovery rate properly over all detected regions. The csaw package can handle arbitrarily complex experimental designs involving biological replicates. It can be applied to both transcription factor and histone mark datasets, and, more generally, to any type of sequencing data measuring genomic coverage. csaw performs favorably against existing methods for de novo DB analyses on both simulated and real data. csaw is implemented as a R software package and is freely available from the open-source Bioconductor project.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

BioMed Research International ◽

10.1155/2015/780519 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Yuxiang Tan ◽

Yann Tambouret ◽

Stefano Monti

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Performance Metrics ◽

Simulated Data ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Detection Algorithms ◽

Fusion Detection

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

Download Full-text

CLUSTERING-BASED APPROACH FOR PREDICTING MOTIF PAIRS FROM PROTEIN INTERACTION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720009004266 ◽

2009 ◽

Vol 07 (04) ◽

pp. 701-716 ◽

Cited By ~ 3

Author(s):

HENRY CHI-MING LEUNG ◽

MAN-HUNG SIU ◽

SIU-MING YIU ◽

FRANCIS YUK-LUN CHIN ◽

KEN WING-KIN SUNG

Keyword(s):

Lower Bound ◽

Protein Interaction ◽

Scoring Function ◽

Simulated Data ◽

Protein Sequences ◽

Protein Interaction Data ◽

P Value ◽

Interaction Data ◽

Motif Pair ◽

Data Set

Predicting motif pairs from a set of protein sequences based on the protein–protein interaction data is an important, but difficult computational problem. Tan et al. proposed a solution to this problem. However, the scoring function (using χ2 testing) used in their approach is not adequate and their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. Later, Leung et al. proposed an improved scoring function and faster algorithms for solving the same problem. But, the model used in Leung et al. is complicated. The exact value of the scoring function is not easy to compute and an estimated value is used in practice. In this paper, we derive a better model to capture the significance of a given motif pair based on a clustering notion. We develop a fast heuristic algorithm to solve the problem. The algorithm is able to locate the correct motif pair in the yeast data set in about 45 minutes for 5000 protein sequences and 20,000 interactions. Moreover, we derive a lower bound result for the p-value of a motif pair in order for it to be distinguishable from random motif pairs. The lower bound result has been verified using simulated data sets. Availability:

Download Full-text