scholarly journals Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy

2016 ◽  
Author(s):  
Theo Knijnenburg ◽  
Gunnar Klau ◽  
Francesco Iorio ◽  
Mathew Garnett ◽  
Ultan McDermott ◽  
...  

Mining large datasets using machine learning approaches often leads to models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested. Finding 'actionable knowledge' is becoming more important, but also more challenging as datasets grow in size and complexity. We present 'Logic Optimization for Binary Input to Continuous Output' (LOBICO), a computational approach that infers small and easily interpretable logic models of binary input features that explain a binarized continuous output variable. Although the continuous output variable is binarized prior to optimization, the continuous information is retained to find the optimal logic model. Applying LOBICO to a large cancer cell line panel, we find that logic combinations of multiple mutations are more predictive of drug response than single gene predictors. Importantly, we show that the use of the continuous information leads to robust and more accurate logic models. LOBICO is formulated as an integer programming problem, which enables rapid computation on large datasets. Moreover, LOBICO implements the ability to uncover logic models around predefined operating points in terms of sensitivity and specificity. As such, it represents an important step towards practical application of interpretable logic models.

Eye ◽  
2021 ◽  
Author(s):  
Lutfiah Al-Turk ◽  
James Wawrzynski ◽  
Su Wang ◽  
Paul Krause ◽  
George M. Saleh ◽  
...  

Abstract Background In diabetic retinopathy (DR) screening programmes feature-based grading guidelines are used by human graders. However, recent deep learning approaches have focused on end to end learning, based on labelled data at the whole image level. Most predictions from such software offer a direct grading output without information about the retinal features responsible for the grade. In this work, we demonstrate a feature based retinal image analysis system, which aims to support flexible grading and monitor progression. Methods The system was evaluated against images that had been graded according to two different grading systems; The International Clinical Diabetic Retinopathy and Diabetic Macular Oedema Severity Scale and the UK’s National Screening Committee guidelines. Results External evaluation on large datasets collected from three nations (Kenya, Saudi Arabia and China) was carried out. On a DR referable level, sensitivity did not vary significantly between different DR grading schemes (91.2–94.2.0%) and there were excellent specificity values above 93% in all image sets. More importantly, no cases of severe non-proliferative DR, proliferative DR or DMO were missed. Conclusions We demonstrate the potential of an AI feature-based DR grading system that is not constrained to any specific grading scheme.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2927 ◽  
Author(s):  
Linh Nguyen ◽  
Cuong C Dang ◽  
Pedro J. Ballester

Background:Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.Methods:Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC50measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than K-fold cross-validation.Results and Discussion:Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.Conclusions:We now know that this type of models can predictin vitrotumour response to these drugs. These models can thus be further investigated onin vivotumour models.


2020 ◽  
Author(s):  
Kenny F Chou ◽  
Virginia Best ◽  
H Steven Colburn ◽  
Kamal Sen

AbstractListening in an acoustically cluttered scene remains a difficult task for both machines and hearing-impaired listeners. Normal-hearing listeners accomplish this task with relative ease by segregating the scene into its constituent sound sources, then selecting and attending to a target source. An assistive listening device that mimics the biological mechanisms underlying this behavior may provide an effective solution for those with difficulty listening in acoustically cluttered environments (e.g., a cocktail party). Here, we present a binaural sound segregation algorithm based on a hierarchical network model of the auditory system. In the algorithm, binaural sound inputs first drive populations of neurons tuned to specific spatial locations and frequencies. Lateral inhibition then sharpens the spatial response of the neurons. Finally, the spiking response of neurons in the output layer are then reconstructed into audible waveforms via a novel reconstruction method. We evaluate the performance of the algorithm with psychoacoustic measures of normal-hearing listeners. This two-microphone algorithm is shown to provide listeners with perceptual benefit similar to that of a 16-microphone acoustic beamformer in a difficult listening task. Unlike deep-learning approaches, the proposed algorithm is biologically interpretable and does not need to be trained on large datasets. This study presents a biologically based algorithm for sound source segregation as well as a method to reconstruct highly intelligible audio signals from spiking models.Author SummaryAnimal and humans can navigate complex auditory environments with relative ease, attending to certain sounds while suppressing others. Normally, various sounds originate from various spatial locations. This paper presents an algorithmic model to perform sound segregation based on how animals make use of this spatial information at various stages of the auditory pathway. We showed that the performance of this two-microphone algorithm provides as much benefit to normal-hearing listeners a multi-microphone algorithm. Unlike mathematical and machine-learning approaches, our model is fully interpretable and does not require training with large datasets. Such an approach may benefit the design of machine hearing algorithms. To interpret the spike-trains generated in the model, we designed a method to recover sounds from model spikes with high intelligibility. This method can be applied to spiking neural networks for audio-related applications, or to interpret each node within a spiking model of the auditory cortex.


Cancers ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 239
Author(s):  
Robert Cardnell ◽  
Lauren Byers ◽  
Jing Wang

The benefit and burden of contemporary techniques for the molecular characterization of samples is the vast amount of data generated. In the era of “big data”, it has become imperative that we develop multi-disciplinary teams combining scientists, clinicians, and data analysts. In this review, we discuss a number of approaches developed by our University of Texas MD Anderson Lung Cancer Multidisciplinary Program to process and utilize such large datasets with the goal of identifying rational therapeutic options for biomarker-driven patient subsets. Large integrated datasets such as the The Cancer Genome Atlas (TCGA) for patient samples and the Cancer Cell Line Encyclopedia (CCLE) for tumor derived cell lines include genomic, transcriptomic, methylation, miRNA, and proteomic profiling alongside clinical data. To best use these datasets to address urgent questions such as whether we can define molecular subtypes of disease with specific therapeutic vulnerabilities, to quantify states such as epithelial-to-mesenchymal transition that are associated with resistance to treatment, or to identify potential therapeutic agents in models of cancer that are resistant to standard treatments required the development of tools for systematic, unbiased high-throughput analysis. Together, such tools, used in a multi-disciplinary environment, can be leveraged to identify novel treatments for molecularly defined subsets of cancer patients, which can be easily and rapidly translated from benchtop to bedside.


2016 ◽  
Author(s):  
Linh C. Nguyen ◽  
Cuong C. Dang ◽  
Pedro J. Ballester

AbstractSelected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC50measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than K-fold cross-validation. Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.


2020 ◽  
Vol 20 (S8) ◽  
Author(s):  
Biao An ◽  
Qianwen Zhang ◽  
Yun Fang ◽  
Ming Chen ◽  
Yufang Qin

Abstract Background Prediction of drug response based on multi-omics data is a crucial task in the research of personalized cancer therapy. Results We proposed an iterative sure independent ranking and screening (ISIRS) scheme to select drug response-associated features and applied it to the Cancer Cell Line Encyclopedia (CCLE) dataset. For each drug in CCLE, we incorporated multi-omics data including copy number alterations, mutation and gene expression and selected up to 50 features using ISIRS. Then a linear regression model based on the selected features was exploited to predict the drug response. Cross validation test shows that our prediction accuracies are higher than existing methods for most drugs. Conclusions Our study indicates that the features selected by the marginal utility measure, which measures the conditional probability of drug responses given the feature, are helpful for drug response prediction.


Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 296 ◽  
Author(s):  
Luca Albergante ◽  
Evgeny Mirkes ◽  
Jonathan Bac ◽  
Huidong Chen ◽  
Alexis Martin ◽  
...  

Multidimensional datapoint clouds representing large datasets are frequently characterized by non-trivial low-dimensional geometry and topology which can be recovered by unsupervised machine learning approaches, in particular, by principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some constraints imposed on the node mapping. Here we present ElPiGraph, a scalable and robust method for constructing principal graphs. ElPiGraph exploits and further develops the concept of elastic energy, the topological graph grammar approach, and a gradient descent-like optimization of the graph topology. The method is able to withstand high levels of noise and is capable of approximating data point clouds via principal graph ensembles. This strategy can be used to estimate the statistical significance of complex data features and to summarize them into a single consensus principal graph. ElPiGraph deals efficiently with large datasets in various fields such as biology, where it can be used for example with single-cell transcriptomic or epigenomic datasets to infer gene expression dynamics and recover differentiation landscapes.


Sign in / Sign up

Export Citation Format

Share Document