scholarly journals Surrogate minimal depth as an importance measure for variables in random forests

2019 ◽  
Vol 35 (19) ◽  
pp. 3663-3671 ◽  
Author(s):  
Stephan Seifert ◽  
Sven Gundlach ◽  
Silke Szymczak

Abstract Motivation It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult. Results Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting. Availability and implementation https://github.com/StephanSeifert/SurrogateMinimalDepth. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Alain J Mbebi ◽  
Hao Tong ◽  
Zoran Nikoloski

AbstractMotivationGenomic selection (GS) is currently deemed the most effective approach to speed up breeding of agricultural varieties. It has been recognized that consideration of multiple traits in GS can improve accuracy of prediction for traits of low heritability. However, since GS forgoes statistical testing with the idea of improving predictions, it does not facilitate mechanistic understanding of the contribution of particular single nucleotide polymorphisms (SNP).ResultsHere, we propose a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS. The usage of the L2,1-norm facilitates variable selection in a penalized multivariate regression that considers the relation between individuals, when the number of SNPs is much larger than the number of individuals. The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits. Our comparative analyses demonstrate that the proposed model is a favorable candidate compared to existing state-of-the-art approaches. Prediction and variable selection with datasets from Brassica napus, wheat and Arabidopsis thaliana diversity panels are conducted to further showcase the performance of the proposed model.Availability and implementation: The model is implemented using R programming language and the code is freely available from https://github.com/alainmbebi/L21-norm-GS.Supplementary informationSupplementary data are available at Bioinformatics online.


Author(s):  
Tao Jiang ◽  
Yuanyuan Li ◽  
Alison A Motsinger-Reif

Abstract Motivation The recently proposed knockoff filter is a general framework for controlling the false discovery rate (FDR) when performing variable selection. This powerful new approach generates a ‘knockoff’ of each variable tested for exact FDR control. Imitation variables that mimic the correlation structure found within the original variables serve as negative controls for statistical inference. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. Here, we extend the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. However, currently available importance scores in tree models are insufficient for variable selection with FDR control. Results We propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extend the current knockoff method to model-free variable selection through the use of tree-based models. Additionally, we propose and evaluate two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test and compare these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compare the properties and performance of importance test statistics of tree models. The results include different combinations of knockoffs and importance test statistics. We consider scenarios that include main-effect, interaction, exponential and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using Cancer Genome Atlas (TCGA) gene expression data. Our results show improved discrimination between difficult-to-discriminate cancer types. Availability and implementation The proposed algorithm is included in the KOBT package, which is available at https://cran.r-project.org/web/packages/KOBT/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Lun Hu ◽  
Jun Zhang ◽  
Xiangyu Pan ◽  
Hong Yan ◽  
Zhu-Hong You

Abstract Motivation Clustering analysis in a biological network is to group biological entities into functional modules, thus providing valuable insight into the understanding of complex biological systems. Existing clustering techniques make use of lower-order connectivity patterns at the level of individual biological entities and their connections, but few of them can take into account of higher-order connectivity patterns at the level of small network motifs. Results Here, we present a novel clustering framework, namely HiSCF, to identify functional modules based on the higher-order structure information available in a biological network. Taking advantage of higher-order Markov stochastic process, HiSCF is able to perform the clustering analysis by exploiting a variety of network motifs. When compared with several state-of-the-art clustering models, HiSCF yields the best performance for two practical clustering applications, i.e. protein complex identification and gene co-expression module detection, in terms of accuracy. The promising performance of HiSCF demonstrates that the consideration of higher-order network motifs gains new insight into the analysis of biological networks, such as the identification of overlapping protein complexes and the inference of new signaling pathways, and also reveals the rich higher-order organizational structures presented in biological networks. Availability and implementation HiSCF is available at https://github.com/allenv5/HiSCF. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 53 (10) ◽  
pp. 4691-4715 ◽  
Author(s):  
Mohammad Islam Miah ◽  
Salim Ahmed ◽  
Sohrab Zendehboudi ◽  
Stephen Butt

2017 ◽  
Vol 2017 ◽  
pp. 1-12 ◽  
Author(s):  
Andreas Mayr ◽  
Benjamin Hofner ◽  
Elisabeth Waldmann ◽  
Tobias Hepp ◽  
Sebastian Meyer ◽  
...  

Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.


2020 ◽  
Author(s):  
Julia Duncan ◽  
Lun Li ◽  
Vahid Mohammadrezaei ◽  
Laina Geary

We developed a direct catalytic condensation of benzylic alcohols and primary alcohols to synthesize unsymmetrical ethers in one step, catalyzed by scandium triflate and p-dimethylaminopyridine (DMAP). Preliminary experiments give some insight into the mechanism of the reaction, though suggest that the process is quite complex. We suspect the rapid formation of a dimer from a secondary benzylic alcohol via a carbocation intermediate precedes unsymmetrical ether formation. Full experimental details and spectroscopic data are provided as supplementary information.


Author(s):  
Roman Egger ◽  
Oguzcan Gumus ◽  
Elza Kaiumova ◽  
Richard Mükisch ◽  
Veronika Surkic

AbstractSocial media plays a key role in shaping the image of a destination. Although recent research has investigated factors influencing online users’ perception towards destination image, limited studies encompass and compare social media content shared by tourists and destination management organisations (DMOs) at the same time. This paper aims to determine whether the projected image of DMOs corresponds with the destination image perceived by tourists. By taking the Austrian Alpine resort Saalbach-Hinterglemm as a case, a netnographic approach was applied to analyse the visual and textual posts of DMO and user-generated content (UGC) on Instagram using machine learning. The findings reveal themes that are not covered in the posts published by marketers but do appear in UGC. This study adds to the existing literature by providing a deeper insight into destination image formation and uses a qualitative approach to assess destination brand image. It further highlights practical implications for the industry regarding DMOs’ social media marketing strategy.


2009 ◽  
Vol 6 (1) ◽  
pp. 18-29
Author(s):  
Yassir Semmar

The purpose of this study is to gain a better insight into the reasons that make Qatar University students reluctant to attend professors’ office hours. Factor analysis was first conducted to reveal the components underlying this reluctance; Multivariate Analysis of Variance (MANOVA) was then employed to analyze the effects of gender, GPA, credit hours completed, year of enrollment, and college/major on those factors. Results indicated that professor's competence and demeanor, course characteristics, students' social skills, attitudes/motivation, time conflict/communication style, students' apprehension as well as their physical/emotional state were all related to their reluctance to attend office hours. Moreover the predictor variables of gender, GPA, and credit hours completed had significant effects on several of those seven reluctance factors.


2021 ◽  
Vol 17 (3) ◽  
Author(s):  
Anne Pellikka ◽  
Sonja Lutovac ◽  
Raimo Kaasila

This study examines the relationships between preservice primary teachers’ (PSTs) views, understandings, and implementations of inquiry-based teaching (IBT) in primary biology education. In earlier studies, these relationships have been researched separately. Exploring them simultaneously allows a greater insight into the process of teacher change and science teacher identity development. Drawing on the narrative method, data included learning diaries, lesson plans, and interviews during a two year research period. Our findings reveal the complex relationships between three aspects of IBT. For example, embracing views of IBT were sometimes accompanied by a significant understanding of IBT and other times by a weak understanding. Whereas, hesitant views of IBT also went together with significant understanding. We discuss these relationships in the light of their impact on science teacher identity and provide suggestions for teacher education.


Sign in / Sign up

Export Citation Format

Share Document