binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions

AbstractBackgroundIn this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest1 (RF) classifiers are widely used2–7 due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.MethodsbinomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees.ResultsIn simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.ConclusionbinomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data.AvailabilityGithub: https://github.com/SamirRachidZaim/binomialRFSupplementary informationSupplementary analyses and results are available at https://github.com/SamirRachidZaim/binomialRF_simulationStudy

Download Full-text

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

BMC Bioinformatics ◽

10.1186/s12859-020-03718-9 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Samir Rachid Zaim ◽

Colleen Kenost ◽

Joanne Berghout ◽

Wesley Chiu ◽

Liam Wilson ◽

...

Keyword(s):

Feature Selection ◽

Binomial Distribution ◽

Data Science ◽

Molecular Mechanisms ◽

Alternative Interpretation ◽

Biomarker Detection ◽

Feature Selection Technique ◽

Correlated Binomial ◽

Testing Algorithm ◽

Main Effects

Abstract Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P > > N” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Results In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.

Download Full-text

Streamlined use of protein structures in variant analysis

10.1101/2021.09.10.459756 ◽

2021 ◽

Author(s):

Sandeep Kaur ◽

Neblina Sikta ◽

Andrea Schafferhans ◽

Nicola Bordin ◽

Mark J. Cowley ◽

...

Keyword(s):

Protein Function ◽

Molecular Mechanisms ◽

Structural Information ◽

Protein Structures ◽

Structural Data ◽

Supplementary Information ◽

3D Structures ◽

Link Type ◽

Variant Analysis ◽

Many Sources

AbstractMotivationVariant analysis is a core task in bioinformatics that requires integrating data from many sources. This process can be helped by using 3D structures of proteins, which can provide a spatial context that can provide insight into how variants affect function. Many available tools can help with mapping variants onto structures; but each has specific restrictions, with the result that many researchers fail to benefit from valuable insights that could be gained from structural data.ResultsTo address this, we have created a streamlined system for incorporating 3D structures into variant analysis. Variants can be easily specified via URLs that are easily readable and writable, and use the notation recommended by the Human Genome Variation Society (HGVS). For example, ‘https://aquaria.app/SARS-CoV-2/S/?N501Y’ specifies the N501Y variant of SARS-CoV-2 S protein. In addition to mapping variants onto structures, our system provides summary information from multiple external resources, including COSMIC, CATH-FunVar, and PredictProtein. Furthermore, our system identifies and summarizes structures containing the variant, as well as the variant-position. Our system supports essentially any mutation for any well-studied protein, and uses all available structural data — including models inferred via very remote homology — integrated into a system that is fast and simple to use. By giving researchers easy, streamlined access to a wealth of structural information during variant analysis, our system will help in revealing novel insights into the molecular mechanisms underlying protein function in health and disease.AvailabilityOur resource is freely available at the project home page (https://aquaria.app). After peer review, the code will be openly available via a GPL version 2 license at https://github.com/ODonoghueLab/Aquaria. PSSH2, the database of sequence-to-structure alignments, is also freely available for download at https://zenodo.org/record/[email protected] informationNone.

Download Full-text

BiCoN: Network-constrained biclustering of patients and omics data

10.1101/2020.01.31.926345 ◽

2020 ◽

Author(s):

Olga Lazareva ◽

Hoan Van Do ◽

Stefan Canzar ◽

Kevin Yuan ◽

Jan Baumbach ◽

...

Keyword(s):

Gene Expression ◽

Molecular Mechanisms ◽

Cell Function ◽

Immune Cell ◽

Supplementary Information ◽

Learning Approaches ◽

Expression Data ◽

Cancer Subtypes ◽

Link Type ◽

Patient Subgroups

AbstractMotivationUnsupervised learning approaches are frequently employed to identify patient subgroups and biomarkers such as disease-associated genes. Thus, clustering and biclustering are powerful techniques often used with expression data, but are usually not suitable to unravel molecular mechanisms along with patient subgroups. To alleviate this, we developed the network-constrained biclustering approach BiCoN (Biclustering Constrained by Networks) which (i) restricts biclusters to functionally related genes connected in molecular interaction networks and (ii) maximizes the difference in gene expression between two subgroups of patients.ResultsOur analyses of non-small cell lung and breast cancer gene expression data demonstrate that BiCoN clusters patients in agreement with known cancer subtypes while discovering gene subnetworks pointing to functional differences between these subtypes. Furthermore, we show that BiCoN is robust to noise and batch effects and can distinguish between high and low load of tumor-infiltrating leukocytes while identifying subnetworks related to immune cell function. In summary, BiCoN is a powerful new systems medicine tool to stratify patients while elucidating the responsible disease mechanism.AvailabilityPyPI package: https://pypi.org/project/biconWeb interface: https://exbio.wzw.tum.de/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Consensus features nested cross-validation

Bioinformatics ◽

10.1093/bioinformatics/btaa046 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3093-3098 ◽

Cited By ~ 2

Author(s):

Saeid Parvandeh ◽

Hung-Wen Yeh ◽

Martin P Paulus ◽

Brett A McKinney

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Cross Validation ◽

Differential Privacy ◽

Simulated Data ◽

Classification Model ◽

Supplementary Information ◽

Similar Accuracy ◽

Main Effects ◽

Feature Stability

Abstract Summary Feature selection can improve the accuracy of machine-learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy-preserving noise mechanism to identify features that are stable between training and holdout sets. We develop consensus nested cross-validation (cnCV) that combines the idea of feature stability from differential privacy with nCV. Feature selection is applied in each inner fold and the consensus of top features across folds is used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by cross-validation, differential privacy and private evaporative cooling (pEC). We also compare these methods using real RNA-seq data from a study of major depressive disorder. The cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification. Availability and implementation Code available at https://github.com/insilico/cncv. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Consensus Features Nested Cross-Validation

10.1101/2019.12.31.891895 ◽

2020 ◽

Author(s):

Saeid Parvandeh ◽

Hung-Wen Yeh ◽

Martin P. Paulus ◽

Brett A. McKinney

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Cross Validation ◽

Differential Privacy ◽

Simulated Data ◽

Classification Model ◽

Supplementary Information ◽

Similar Accuracy ◽

Main Effects ◽

Feature Stability

AbstractMotivationFeature selection can improve the accuracy of machine learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy preserving noise mechanism to identify features that are stable between training and holdout sets.MethodsWe develop consensus nested CV (cnCV) that combines the idea of feature stability from differential privacy with nested CV. Feature selection is applied in each inner fold and the consensus of top features across folds is a used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation, and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by CV, differential privacy, and private Evaporative Cooling (pEC). We also compare these methods using real RNA-Seq data from a study of major depressive disorder.ResultsThe cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification.AvailabilityCode available at https://github.com/insilico/[email protected] information:

Download Full-text

Effects of kinship correction on inflation of genetic interaction statistics in commonly used mouse populations

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab131 ◽

2021 ◽

Author(s):

Anna L Tyler ◽

Baha El Kassaby ◽

Georgi Kolishovski ◽

Jake Emerson ◽

Ann E Wells ◽

...

Keyword(s):

Mixed Model ◽

Linear Mixed Model ◽

Genetic Interaction ◽

Recombinant Inbred Lines ◽

Test Statistics ◽

Test Statistic ◽

Kinship Matrix ◽

Main Effect ◽

Main Effects ◽

Interaction Test

Abstract It is well understood that variation in relatedness among individuals, or kinship, can lead to false genetic associations. Multiple methods have been developed to adjust for kinship while maintaining power to detect true associations. However, relatively unstudied, are the effects of kinship on genetic interaction test statistics. Here we performed a survey of kinship effects on studies of six commonly used mouse populations. We measured inflation of main effect test statistics, genetic interaction test statistics, and interaction test statistics reparametrized by the Combined Analysis of Pleiotropy and Epistasis (CAPE). We also performed linear mixed model (LMM) kinship corrections using two types of kinship matrix: an overall kinship matrix calculated from the full set of genotyped markers, and a reduced kinship matrix, which left out markers on the chromosome(s) being tested. We found that test statistic inflation varied across populations and was driven largely by linkage disequilibrium. In contrast, there was no observable inflation in the genetic interaction test statistics. CAPE statistics were inflated at a level in between that of the main effects and the interaction effects. The overall kinship matrix overcorrected the inflation of main effect statistics relative to the reduced kinship matrix. The two types of kinship matrices had similar effects on the interaction statistics and CAPE statistics, although the overall kinship matrix trended toward a more severe correction. In conclusion, we recommend using a LMM kinship correction for both main effects and genetic interactions and further recommend that the kinship matrix be calculated from a reduced set of markers in which the chromosomes being tested are omitted from the calculation. This is particularly important in populations with substantial population structure, such as recombinant inbred lines in which genomic replicates are used.

Download Full-text

REMAXINT: a two-mode clustering-based method for statistical inference on two-way interaction

Advances in Data Analysis and Classification ◽

10.1007/s11634-021-00441-y ◽

2021 ◽

Author(s):

Zaheer Ahmed ◽

Alberto Cassese ◽

Gerard van Breukelen ◽

Jan Schepers

Keyword(s):

Null Hypothesis ◽

Type I Error ◽

Type I ◽

The Novel ◽

Simulation Studies ◽

Test Statistic ◽

Clustering Model ◽

Novel Method ◽

Main Effects ◽

Empirical Case Studies

AbstractWe present a novel method, REMAXINT, that captures the gist of two-way interaction in row by column (i.e., two-mode) data, with one observation per cell. REMAXINT is a probabilistic two-mode clustering model that yields two-mode partitions with maximal interaction between row and column clusters. For estimation of the parameters of REMAXINT, we maximize a conditional classification likelihood in which the random row (or column) main effects are conditioned out. For testing the null hypothesis of no interaction between row and column clusters, we propose a $$max-F$$ m a x - F test statistic and discuss its properties. We develop a Monte Carlo approach to obtain its sampling distribution under the null hypothesis. We evaluate the performance of the method through simulation studies. Specifically, for selected values of data size and (true) numbers of clusters, we obtain critical values of the $$max-F$$ m a x - F statistic, determine empirical Type I error rate of the proposed inferential procedure and study its power to reject the null hypothesis. Next, we show that the novel method is useful in a variety of applications by presenting two empirical case studies and end with some concluding remarks.

Download Full-text

A Hypothesis Test for the Goodness-of-Fit of the Marginal Distribution of a Time Series with Application to Stablecoin Data

Engineering Proceedings ◽

10.3390/engproc2021005010 ◽

2021 ◽

Vol 5 (1) ◽

pp. 10

Author(s):

Mark Levene

Keyword(s):

Time Series ◽

Goodness Of Fit ◽

Marginal Distribution ◽

Hypothesis Test ◽

Data Sets ◽

Test Statistic ◽

Sample Test ◽

Kolmogorov Smirnov ◽

Heavy Tailed ◽

Jensen Shannon Divergence

A bootstrap-based hypothesis test of the goodness-of-fit for the marginal distribution of a time series is presented. Two metrics, the empirical survival Jensen–Shannon divergence (ESJS) and the Kolmogorov–Smirnov two-sample test statistic (KS2), are compared on four data sets—three stablecoin time series and a Bitcoin time series. We demonstrate that, after applying first-order differencing, all the data sets fit heavy-tailed α-stable distributions with 1<α<2 at the 95% confidence level. Moreover, ESJS is more powerful than KS2 on these data sets, since the widths of the derived confidence intervals for KS2 are, proportionately, much larger than those of ESJS.

Download Full-text

Tracing the origins of SARS-COV-2 in coronavirus phylogenies: a review

Environmental Chemistry Letters ◽

10.1007/s10311-020-01151-1 ◽

2021 ◽

Vol 19 (2) ◽

pp. 769-785 ◽

Cited By ~ 1

Author(s):

Erwan Sallard ◽

José Halloy ◽

Didier Casane ◽

Etienne Decroly ◽

Jacques van Helden

Keyword(s):

Sequence Analysis ◽

Structure Function ◽

Molecular Mechanisms ◽

Laboratory Strain ◽

Domestic Animals ◽

Scientific Policy ◽

Human Coronavirus ◽

Viral Dissemination ◽

Link Type ◽

Intensive Breeding

AbstractSARS-CoV-2 is a new human coronavirus (CoV), which emerged in China in late 2019 and is responsible for the global COVID-19 pandemic that caused more than 97 million infections and 2 million deaths in 12 months. Understanding the origin of this virus is an important issue, and it is necessary to determine the mechanisms of viral dissemination in order to contain future epidemics. Based on phylogenetic inferences, sequence analysis and structure–function relationships of coronavirus proteins, informed by the knowledge currently available on the virus, we discuss the different scenarios on the origin—natural or synthetic—of the virus. The data currently available are not sufficient to firmly assert whether SARS-CoV2 results from a zoonotic emergence or from an accidental escape of a laboratory strain. This question needs to be solved because it has important consequences on the risk/benefit balance of our interactions with ecosystems, on intensive breeding of wild and domestic animals, on some laboratory practices and on scientific policy and biosafety regulations. Regardless of COVID-19 origin, studying the evolution of the molecular mechanisms involved in the emergence of pandemic viruses is essential to develop therapeutic and vaccine strategies and to prevent future zoonoses. This article is a translation and update of a French article published in Médecine/Sciences, August/September 2020 (10.1051/medsci/2020123).

Download Full-text

Parallelized calculation of permutation tests

Bioinformatics ◽

10.1093/bioinformatics/btaa1007 ◽

2020 ◽

Author(s):

Markus Ekvall ◽

Michael Höhle ◽

Lukas Käll

Keyword(s):

Dynamic Programming ◽

Sample Size ◽

Permutation Test ◽

Statistical Tests ◽

Permutation Tests ◽

Supplementary Information ◽

Attractive Alternative ◽

Test Statistic ◽

Sample Distribution ◽

Running Time

Abstract Motivation Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive. Results Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up—by orders of magnitude—is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test. Availabilityand implementation In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text