Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Abstract Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable, and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite, and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results is available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Learning Sparse Log-Ratios for High-Throughput Sequencing Data

10.1101/2021.02.11.430695 ◽

2021 ◽

Author(s):

Elliott Gordon-Rodriguez ◽

Thomas P. Quinn ◽

John P. Cunningham

Keyword(s):

High Throughput ◽

Latent Variables ◽

High Throughput Sequencing ◽

Compositional Data ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Sequencing Data ◽

Genetic Sequencing ◽

Wide Range ◽

Benchmark Datasets

AbstractThe automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.

Download Full-text

A metaheuristic algorithm for project selection and scheduling with due windows and limited inventory capacity

Kybernetes ◽

10.1108/k-11-2013-0245 ◽

2014 ◽

Vol 43 (9/10) ◽

pp. 1483-1499 ◽

Cited By ~ 1

Author(s):

Christopher Garcia

Keyword(s):

Combinatorial Optimization ◽

Optimization Problem ◽

Combinatorial Optimization Problem ◽

Greedy Heuristic ◽

Planning Problem ◽

Large Set ◽

Effective Solution ◽

Heavy Industry ◽

Content Type ◽

Wide Range

Purpose – The purpose of this paper is to provide an effective solution for a complex planning problem encountered in heavy industry. The problem entails selecting a set of projects to produce from a larger set of solicited projects and simultaneously scheduling their production to maximize profit. Each project has a due window inside of which, if accepted, it must be shipped. Additionally, there is a limited inventory buffer where lots produced early are stored. Because scheduling affects which projects may be selected and vice-versa, this is a particularly difficult combinatorial optimization problem. Design/methodology/approach – The authors develop an algorithm based on the Metaheuristic for Randomized Priority Search (Meta-RaPS) as well as a greedy heuristic and an integer programming (IP) model. The authors then perform computational experiments on a large set of benchmark problems over a wide range of characteristics to compare the performance of each method in terms of solution quality and time required. Findings – The paper shows that this problem is very difficult to solve using IP, with even small instances unable to be solved optimally. The paper then shows that both proposed algorithms will in seconds often outperform IP by a large margin. Meta-RaPS is particularly robust, consistently producing the best or very near-best solutions. Practical implications – The Meta-RaPS algorithm developed enables companies facing this problem to achieve higher profits through improved decision making. Moreover, this algorithm is relatively easy to implement. Originality/value – This research provides an effective solution for a difficult combinatorial optimization problem encountered in heavy industry which has not been previously addressed in the literature.

Download Full-text

Compositional uncertainty should not be ignored in high-throughput sequencing data analysis

Austrian Journal of Statistics ◽

10.17713/ajs.v45i4.122 ◽

2016 ◽

Vol 45 (4) ◽

pp. 73-87 ◽

Cited By ~ 32

Author(s):

Gregory Brian Gloor ◽

Jean M. Macklaim ◽

Michael Vu ◽

Andrew D. Fernandes

Keyword(s):

Data Analysis ◽

High Throughput ◽

False Positive ◽

High Throughput Sequencing ◽

In Vitro Selection ◽

Compositional Data ◽

Dirichlet Distribution ◽

Compositional Data Analysis ◽

Compositional Approach ◽

Log Ratio

High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing.

Download Full-text

Solving a Combinatorial Optimization Problem with Feedforward Neural Networks

1993 American Control Conference ◽

10.23919/acc.1993.4793106 ◽

1993 ◽

Author(s):

Xianzhong Cui ◽

Kang G. Shin

Keyword(s):

Neural Networks ◽

Combinatorial Optimization ◽

Optimization Problem ◽

Feedforward Neural Networks ◽

Combinatorial Optimization Problem

Download Full-text

Inverse version of the kth maximization combinatorial optimization problem

Can Tho University Journal of Science ◽

10.22144/ctu.jen.2018.027 ◽

2018 ◽

Vol 54(5) ◽

pp. 72

Author(s):

Quoc, H.D. ◽

Kien, N.T. ◽

Thuy, T.T.C. ◽

Hai, L.H. ◽

Thanh, V.N.

Keyword(s):

Combinatorial Optimization ◽

Optimization Problem ◽

Combinatorial Optimization Problem

Download Full-text

A combinatorial optimization problem; optimal generalized cycle bases

Computer Methods in Applied Mechanics and Engineering ◽

10.1016/0045-7825(79)90057-4 ◽

1979 ◽

Vol 20 (1) ◽

pp. 39-51 ◽

Cited By ~ 42

Author(s):

A. Kaveh

Keyword(s):

Combinatorial Optimization ◽

Optimization Problem ◽

Combinatorial Optimization Problem ◽

Cycle Bases

Download Full-text

An OSGI-MPH Algorithm for Solving Combinatorial Optimization Problem

10.1109/aemcse51986.2021.00088 ◽

2021 ◽

Author(s):

Hongming Dai ◽

Yunjing Li ◽

Xinji Zhou

Keyword(s):

Combinatorial Optimization ◽

Optimization Problem ◽

Combinatorial Optimization Problem

Download Full-text

ANALYSIS AND SYNTHESIS OF ENHANCED ANT COLONY OPTIMIZATION WITH THE TRADITIONAL ANT COLONY OPTIMIZATION TO SOLVE TRAVELLING SALES PERSON PROBLEM

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v2i2b.2637 ◽

2011 ◽

Vol 1 (1) ◽

pp. 88-92

Author(s):

Pallavi Arora ◽

Harjeet Kaur ◽

Prateek Agrawal

Keyword(s):

Combinatorial Optimization ◽

Ant Colony Optimization ◽

Optimization Problem ◽

Traditional Approach ◽

Combinatorial Optimization Problem ◽

Ant Colony ◽

Ant Colony Optimization Algorithm ◽

Analysis And Synthesis ◽

Heuristic Technique ◽

Travelling Salesperson Problem

Ant Colony optimization is a heuristic technique which has been applied to a number of combinatorial optimization problem and is based on the foraging behavior of the ants. Travelling Salesperson problem is a combinatorial optimization problem which requires that each city should be visited once. In this research paper we use the K means clustering technique and Enhanced Ant Colony Optimization algorithm to solve the TSP problem. We show a comparison of the traditional approach with the proposed approach. The simulated results show that the proposed algorithm is better compared to the traditional approach.

Download Full-text