miqoGraph: fitting admixture graphs using mixed-integer quadratic optimization

Bioinformatics ◽

10.1093/bioinformatics/btaa988 ◽

2020 ◽

Author(s):

Julia Yan ◽

Nick Patterson ◽

Vagheesh M Narasimhan

Keyword(s):

Genetic Relationship ◽

Real Data ◽

Quadratic Optimization ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Integer Optimization ◽

Speed Up

Abstract Summary Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this article, we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths and admixture proportions simultaneously. Through applications of miqoGraph to both simulated and real data, we show that integer optimization can greatly speed up and automate what is usually an arduous manual process. Availability and implementation https://github.com/juliayyan/PhylogeneticTrees.jl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

miqoGraph: Fitting admixture graphs using mixed-integer quadratic optimization

10.1101/801548 ◽

2019 ◽

Author(s):

Julia Yan ◽

Nick Patterson ◽

Vagheesh Narasimhan

Keyword(s):

Genetic Relationship ◽

Quadratic Optimization ◽

Mixed Integer ◽

Integer Optimization ◽

Link Type

AbstractAdmixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this paper we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths, and admixture proportions simultaneously. Inference of topology is particularly powerful, with integer optimization automating what is usually an arduous manual process.Availabilityhttps://github.com/juliayyan/[email protected]

Download Full-text

Summarizing the solution space in tumor phylogeny inference by multiple consensus trees

Bioinformatics ◽

10.1093/bioinformatics/btz312 ◽

2019 ◽

Vol 35 (14) ◽

pp. i408-i416 ◽

Cited By ~ 12

Author(s):

Nuraini Aguse ◽

Yuanyuan Qi ◽

Mohammed El-Kebir

Keyword(s):

Solution Space ◽

Simulated Data ◽

Exact Algorithm ◽

Real Data ◽

Supplementary Information ◽

Mixed Integer ◽

Consensus Tree ◽

Large Solution ◽

Consensus Trees ◽

Topological Features

Abstract Motivation Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees. Results We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T. Availability and implementation https://github.com/elkebir-group/MCT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

Bioinformatics ◽

10.1093/bioinformatics/btaa856 ◽

2020 ◽

Author(s):

Jorge Alvarez-Jarreta ◽

Patricia R S Rodrigues ◽

Eoin Fahy ◽

Anne O’Connor ◽

Anna Price ◽

...

Keyword(s):

Real Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scatter Plot ◽

Lipid Profiling ◽

False Discovery ◽

False Discovery Rate Method ◽

Rate Method ◽

Assess Data Quality ◽

Lipid Structures

Abstract Summary We present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate method, utilizing a target–decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools. Availability and implementation LipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/lipidfinder. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of variance when both input and output sets are high-dimensional

10.1101/2020.02.15.950949 ◽

2020 ◽

Author(s):

Gustavo de los Campos ◽

Torsten Pook ◽

Agustin Gonzalez-Raymundez ◽

Henner Simianer ◽

George Mias ◽

...

Keyword(s):

Gene Expression ◽

Linear Span ◽

Copy Number Variants ◽

Real Data ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Random Effects Models ◽

Input And Output ◽

Data Layers

AbstractMotivationModern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.ResultsWe propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.AvailabilityThe Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

Bioinformatics ◽

10.1093/bioinformatics/btaa676 ◽

2020 ◽

Author(s):

Simone Ciccolella ◽

Giulia Bernardini ◽

Luca Denti ◽

Paola Bonizzoni ◽

Marco Previtali ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

Similarity Measures ◽

Real Data ◽

Similarity Score ◽

Supplementary Information ◽

Supplementary Data ◽

Wide Range ◽

Golden Standard ◽

History Of

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gapsplit: efficient random sampling for non-convex constraint-based models

Bioinformatics ◽

10.1093/bioinformatics/btz971 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2623-2625 ◽

Cited By ~ 1

Author(s):

Thomas C Keaty ◽

Paul A Jensen

Keyword(s):

Random Sampling ◽

Linear Models ◽

Source Code ◽

Solution Space ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Convex Constraint ◽

Random Samples ◽

Constraint Based Models

Abstract Summary Gapsplit generates random samples from convex and non-convex constraint-based models by targeting under-sampled regions of the solution space. Gapsplit provides uniform coverage of linear, mixed-integer and general non-linear models. Availability and implementation Python and Matlab source code are freely available at http://jensenlab.net/tools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SMART: SuperMaximal approximate repeats tool

Bioinformatics ◽

10.1093/bioinformatics/btz953 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2589-2591

Author(s):

Lorraine A K Ayad ◽

Panagiotis Charalampopoulos ◽

Solon P Pissis

Keyword(s):

State Of The Art ◽

Input Sequence ◽

The State ◽

Supplementary Information ◽

Greedy Heuristics ◽

Supplementary Data ◽

Repeat Analysis ◽

Analysis Tools ◽

Speed Up

Abstract Summary State-of-the-art repeat analysis tools rely on extending maximal repeated pairs to enumerate maximal k-mismatch repeats. These pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied to speed up the extension. Here, we introduce supermaximal k-mismatch repeats, which are linear in n and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly, and show that these elements are statistically much more significant than the output of the state-of-the-art. Availability and implementation http://github.com/lorrainea/smart (GNU GPL v3.0). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

10.1101/2020.08.16.250878 ◽

2020 ◽

Author(s):

Jorge Alvarez-Jarreta ◽

Patricia R.S. Rodrigues ◽

Eoin Fahy ◽

Anne O’Connor ◽

Anna Price ◽

...

Keyword(s):

Open Access ◽

Real Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scatter Plot ◽

Lipid Profiling ◽

Link Type ◽

False Discovery ◽

Assess Data Quality ◽

Lipid Structures

AbstractWe present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate (FDR) method, utilizing a target-decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools.AvailabilityLipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A new statistic for efficient detection of repetitive sequences

Bioinformatics ◽

10.1093/bioinformatics/btz262 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4596-4606 ◽

Cited By ~ 1

Author(s):

Sijie Chen ◽

Yixin Chen ◽

Fengzhu Sun ◽

Michael S Waterman ◽

Xuegong Zhang

Keyword(s):

Linear Time ◽

Repetitive Sequences ◽

Real Data ◽

Space Complexity ◽

Supplementary Information ◽

Supplementary Data ◽

Efficient Detection ◽

Time And Space Complexity ◽

Multiple Scenarios ◽

Repeat Detection

Abstract Motivation Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. Results Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. Availability and implementation The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

KmerFinderJS: A client-server method for fast species typing of bacteria over slow Internet connections

10.1101/145284 ◽

2017 ◽

Cited By ~ 2

Author(s):

Jose Luis Bellod Cineros ◽

Ole Lund

Keyword(s):

Bacterial Species ◽

Source Code ◽

Supplementary Information ◽

The Internet ◽

Supplementary Data ◽

Client Server ◽

Link Type ◽

Genome Data ◽

Speed Up ◽

Internet Connections

AbstractMotivationKmerFinder is a program based on K-mer statistics for identifying bacterial species in whole genome data, that as a web server that have been used more than 10.000 times. Kmer-FinderJS is a development of the KmerFinder that benefits from the downsampling of data using a prefix filtering used by KmerFinder, to minimize amount of data that needs to be transferred between the client and the server.ResultsKmerFinderJS replaces the python based hash structure for holding the databases with a true Key-value database. These improvements are shown to lead to a many-fold speed up of species identification with the internet transfer speeds that are realistic to expect today. It is also shown that the method can find the true content of an artificial metagenomic cocktail with no false positives.AvailabilityThe method is freely available at https://cge.cbs.dtu.dk/services/KmerFinderJS/ and as a source code at https://bitbucket.org/genomicepidemiology/[email protected] informationSupplementary data are available at biorxiv online.

Download Full-text