Summarizing the solution space in tumor phylogeny inference by multiple consensus trees

Nuraini Aguse; Yuanyuan Qi; Mohammed El-Kebir

doi:10.1093/bioinformatics/btz312

Summarizing the solution space in tumor phylogeny inference by multiple consensus trees

Bioinformatics ◽

10.1093/bioinformatics/btz312 ◽

2019 ◽

Vol 35 (14) ◽

pp. i408-i416 ◽

Cited By ~ 12

Author(s):

Nuraini Aguse ◽

Yuanyuan Qi ◽

Mohammed El-Kebir

Keyword(s):

Solution Space ◽

Simulated Data ◽

Exact Algorithm ◽

Real Data ◽

Supplementary Information ◽

Mixed Integer ◽

Consensus Tree ◽

Large Solution ◽

Consensus Trees ◽

Topological Features

Abstract Motivation Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees. Results We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T. Availability and implementation https://github.com/elkebir-group/MCT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

miqoGraph: fitting admixture graphs using mixed-integer quadratic optimization

Bioinformatics ◽

10.1093/bioinformatics/btaa988 ◽

2020 ◽

Author(s):

Julia Yan ◽

Nick Patterson ◽

Vagheesh M Narasimhan

Keyword(s):

Genetic Relationship ◽

Real Data ◽

Quadratic Optimization ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Integer Optimization ◽

Speed Up

Abstract Summary Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this article, we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths and admixture proportions simultaneously. Through applications of miqoGraph to both simulated and real data, we show that integer optimization can greatly speed up and automate what is usually an arduous manual process. Availability and implementation https://github.com/juliayyan/PhylogeneticTrees.jl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient airplane arrival scheduling using a set partitioning-based branch-and-price method

Proceedings of the Institution of Mechanical Engineers Part G Journal of Aerospace Engineering ◽

10.1177/0954410017718566 ◽

2017 ◽

Vol 232 (16) ◽

pp. 2939-2951

Author(s):

Jae-Hoon Song ◽

Han-Lim Choi

Keyword(s):

Large Scale ◽

Time Window ◽

Heuristic Method ◽

Solution Space ◽

Exact Algorithm ◽

Set Partitioning ◽

Mixed Integer ◽

Branch And Price ◽

Mixed Integer Linear Program ◽

Public Data

This article presents an exact algorithm that is combined with a heuristic method to find the optimal solution for an airplane landing problem. For a given set of airplanes and runways, the objective is to minimize the accumulated deviations from the target landing time of the airplanes. A cost associated with landing either earlier or later than the target landing time is incurred for each airplane within its predetermined time window. In order to manage this type of large-scale optimization problem, a set partitioning formulation that results in a mixed integer linear program is proposed. One key contribution of this article is the development of a branch-and-price methodology, in which the column generation method is integrated with the branch-and-bound method in order to find the optimal integer solution. In addition to the exact algorithm, a simple heuristic method is also presented to tighten the solution space. Numerical experiments are undertaken for the proposed algorithm in order to confirm its effectiveness using public data from the OR-Library. As an application in the real-world situation of airplane landing, air traffic data from Incheon International Airport is employed to assure the efficiency of the proposed algorithm.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Joint detection of germline and somatic copy number events in matched tumor–normal sample pairs

Bioinformatics ◽

10.1093/bioinformatics/btz429 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4955-4961

Author(s):

Yongzhuang Liu ◽

Jian Liu ◽

Yadong Wang

Keyword(s):

Copy Number ◽

Simulated Data ◽

Real Data ◽

Copy Number Variations ◽

Superior Performance ◽

Supplementary Information ◽

Normal Sample ◽

Joint Detection ◽

Novel Approach ◽

Powerful Approach

Abstract Motivation Whole-genome sequencing (WGS) of tumor–normal sample pairs is a powerful approach for comprehensively characterizing germline copy number variations (CNVs) and somatic copy number alterations (SCNAs) in cancer research and clinical practice. Existing computational approaches for detecting copy number events cannot detect germline CNVs and SCNAs simultaneously, and yield low accuracy for SCNAs. Results In this study, we developed TumorCNV, a novel approach for jointly detecting germline CNVs and SCNAs from WGS data of the matched tumor–normal sample pair. We compared TumorCNV with existing copy number event detection approaches using the simulated data and real data for the COLO-829 melanoma cell line. The experimental results showed that TumorCNV achieved superior performance than existing approaches. Availability and implementation The software TumorCNV is implemented using a combination of Java and R, and it is freely available from the website at https://github.com/yongzhuang/TumorCNV. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EFFICIENT INFERENCE OF HAPLOTYPES FROM GENOTYPES ON A PEDIGREE

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720003000204 ◽

2003 ◽

Vol 01 (01) ◽

pp. 41-69 ◽

Cited By ~ 54

Author(s):

JING LI ◽

TAO JIANG

Keyword(s):

Large Scale ◽

Gaussian Elimination ◽

Linear Equations ◽

Simulated Data ◽

Exact Algorithm ◽

Real Data ◽

Haplotype Reconstruction ◽

Pedigree Data ◽

Simple Method ◽

Complexity Result

We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.

Download Full-text

flexiMAP: a regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa854 ◽

2020 ◽

Author(s):

Krzysztof J Szkop ◽

David S Moss ◽

Irene Nobeli

Keyword(s):

Simulated Data ◽

Alternative Polyadenylation ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Beta Regression ◽

Rna Seq ◽

Good Balance ◽

Flexible Modeling ◽

Specificity And Sensitivity

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HIVID2: an accurate tool to detect virus integrations in the host genome

Bioinformatics ◽

10.1093/bioinformatics/btab031 ◽

2021 ◽

Author(s):

Xi Zeng ◽

Linghao Zhao ◽

Chenhang Shen ◽

Yi Zhou ◽

Guoliang Li ◽

...

Keyword(s):

Simulated Data ◽

Real Data ◽

Host Genome ◽

Supplementary Information ◽

Future Research ◽

Specificity And Sensitivity ◽

Novel Method ◽

Advanced Analysis ◽

Virus Integration ◽

Better Than

Abstract Motivation Virus integration in the host genome is frequently reported to be closely associated with many human diseases, and the detection of virus integration is a critically challenging task. However, most existing tools show limited specificity and sensitivity. Therefore, the objective of this study is to develop a method for accurate detection of virus integration into host genomes. Results Herein, we report a novel method termed HIVID2 that is a significant upgrade of HIVID. HIVID2 performs a paired-end combination (PE-combination) for potentially integrated reads. The resulting sequences are then remapped onto the reference genomes, and both split and discordant chimeric reads are used to identify accurate integration breakpoints with high confidence. HIVID2 represents a great improvement in specificity and sensitivity, and predicts breakpoints closer to the real integrations, compared with existing methods. The advantage of our method was demonstrated using both simulated and real datasets. HIVID2 uncovered novel integration breakpoints in well-known cervical cancer-related genes, including FHIT and LRP1B, which was verified using protein expression data. In addition, HIVID2 allows the user to decide whether to automatically perform advanced analysis using the identified virus integrations. By analyzing the simulated data and real data tests, we demonstrated that HIVID2 is not only more accurate than HIVID but also better than other existing programs with respect to both sensitivity and specificity. We believe that HIVID2 will help in enhancing future research associated with virus integration. Availabilityand implementation HIVID2 can be accessed at https://github.com/zengxi-hada/HIVID2/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Deconvoluting the diversity of within-host pathogen strains in a multi-locus sequence typing framework

BMC Bioinformatics ◽

10.1186/s12859-019-3204-8 ◽

2019 ◽

Vol 20 (S20) ◽

Cited By ~ 1

Author(s):

Guo Liang Gan ◽

Elijah Willie ◽

Cedric Chauve ◽

Leonid Chindelevitch

Keyword(s):

Borrelia Burgdorferi ◽

Disease Transmission ◽

Bacterial Pathogen ◽

Simulated Data ◽

Real Data ◽

Genomic Diversity ◽

Mixed Integer ◽

Data Set ◽

Mlst Scheme ◽

Host Pathogen

Abstract Background Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. Results We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. Conclusions Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.

Download Full-text

Sampling and summarizing transmission trees with multi-strain infections

Bioinformatics ◽

10.1093/bioinformatics/btaa438 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i362-i370

Author(s):

Palash Sashittal ◽

Mohammed El-Kebir

Keyword(s):

Computational Methods ◽

Hiv Transmission ◽

Simulated Data ◽

Real Data ◽

Epidemiological Data ◽

Pathogen Transmission ◽

Supplementary Information ◽

Pathogen Diversity ◽

Tree Inference ◽

Tree Approach

Abstract Motivation The combination of genomic and epidemiological data holds the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data. Results We formulate the direct transmission inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce Transmission Tree Uniform Sampler (TiTUS), a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritize parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS’s ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain. Availability and implementation https://github.com/elkebir-group/TiTUS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gapsplit: efficient random sampling for non-convex constraint-based models

Bioinformatics ◽

10.1093/bioinformatics/btz971 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2623-2625 ◽

Cited By ~ 1

Author(s):

Thomas C Keaty ◽

Paul A Jensen

Keyword(s):

Random Sampling ◽

Linear Models ◽

Source Code ◽

Solution Space ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Convex Constraint ◽

Random Samples ◽

Constraint Based Models

Abstract Summary Gapsplit generates random samples from convex and non-convex constraint-based models by targeting under-sampled regions of the solution space. Gapsplit provides uniform coverage of linear, mixed-integer and general non-linear models. Availability and implementation Python and Matlab source code are freely available at http://jensenlab.net/tools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text