scholarly journals Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Author(s):  
Elliott Gordon-Rodriguez ◽  
Thomas P Quinn ◽  
John P Cunningham

Abstract Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable, and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite, and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results is available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
Elliott Gordon-Rodriguez ◽  
Thomas P. Quinn ◽  
John P. Cunningham

AbstractThe automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.


Kybernetes ◽  
2014 ◽  
Vol 43 (9/10) ◽  
pp. 1483-1499 ◽  
Author(s):  
Christopher Garcia

Purpose – The purpose of this paper is to provide an effective solution for a complex planning problem encountered in heavy industry. The problem entails selecting a set of projects to produce from a larger set of solicited projects and simultaneously scheduling their production to maximize profit. Each project has a due window inside of which, if accepted, it must be shipped. Additionally, there is a limited inventory buffer where lots produced early are stored. Because scheduling affects which projects may be selected and vice-versa, this is a particularly difficult combinatorial optimization problem. Design/methodology/approach – The authors develop an algorithm based on the Metaheuristic for Randomized Priority Search (Meta-RaPS) as well as a greedy heuristic and an integer programming (IP) model. The authors then perform computational experiments on a large set of benchmark problems over a wide range of characteristics to compare the performance of each method in terms of solution quality and time required. Findings – The paper shows that this problem is very difficult to solve using IP, with even small instances unable to be solved optimally. The paper then shows that both proposed algorithms will in seconds often outperform IP by a large margin. Meta-RaPS is particularly robust, consistently producing the best or very near-best solutions. Practical implications – The Meta-RaPS algorithm developed enables companies facing this problem to achieve higher profits through improved decision making. Moreover, this algorithm is relatively easy to implement. Originality/value – This research provides an effective solution for a difficult combinatorial optimization problem encountered in heavy industry which has not been previously addressed in the literature.


2016 ◽  
Vol 45 (4) ◽  
pp. 73-87 ◽  
Author(s):  
Gregory Brian Gloor ◽  
Jean M. Macklaim ◽  
Michael Vu ◽  
Andrew D. Fernandes

High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing. 


2018 ◽  
Vol 54(5) ◽  
pp. 72
Author(s):  
Quoc, H.D. ◽  
Kien, N.T. ◽  
Thuy, T.T.C. ◽  
Hai, L.H. ◽  
Thanh, V.N.

2011 ◽  
Vol 1 (1) ◽  
pp. 88-92
Author(s):  
Pallavi Arora ◽  
Harjeet Kaur ◽  
Prateek Agrawal

Ant Colony optimization is a heuristic technique which has been applied to a number of combinatorial optimization problem and is based on the foraging behavior of the ants. Travelling Salesperson problem is a combinatorial optimization problem which requires that each city should be visited once. In this research paper we use the K means clustering technique and Enhanced Ant Colony Optimization algorithm to solve the TSP problem. We show a comparison of the traditional approach with the proposed approach. The simulated results show that the proposed algorithm is better compared to the traditional approach.


Sign in / Sign up

Export Citation Format

Share Document