Surrogate minimal depth as an importance measure for variables in random forests

Stephan Seifert; Sven Gundlach; Silke Szymczak

doi:10.1093/bioinformatics/btz149

Surrogate minimal depth as an importance measure for variables in random forests

Bioinformatics ◽

10.1093/bioinformatics/btz149 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3663-3671 ◽

Cited By ~ 6

Author(s):

Stephan Seifert ◽

Sven Gundlach ◽

Silke Szymczak

Keyword(s):

Variable Selection ◽

Supplementary Information ◽

Predictor Variables ◽

Importance Measure ◽

Surrogate Variables ◽

Machine Learning Approach ◽

Complex Relationships ◽

Minimal Depth ◽

Insight Into ◽

Causal Variables

Abstract Motivation It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult. Results Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting. Availability and implementation https://github.com/StephanSeifert/SurrogateMinimalDepth. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

L2,1-norm regularized multivariate regression model with applications to genomic prediction

Bioinformatics ◽

10.1093/bioinformatics/btab212 ◽

2021 ◽

Author(s):

Alain J Mbebi ◽

Hao Tong ◽

Zoran Nikoloski

Keyword(s):

Variable Selection ◽

Regression Model ◽

Multivariate Regression ◽

Statistical Testing ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Multiple Traits ◽

Multivariate Regression Model ◽

Proposed Model ◽

Iterative Optimization Algorithm

AbstractMotivationGenomic selection (GS) is currently deemed the most effective approach to speed up breeding of agricultural varieties. It has been recognized that consideration of multiple traits in GS can improve accuracy of prediction for traits of low heritability. However, since GS forgoes statistical testing with the idea of improving predictions, it does not facilitate mechanistic understanding of the contribution of particular single nucleotide polymorphisms (SNP).ResultsHere, we propose a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS. The usage of the L2,1-norm facilitates variable selection in a penalized multivariate regression that considers the relation between individuals, when the number of SNPs is much larger than the number of individuals. The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits. Our comparative analyses demonstrate that the proposed model is a favorable candidate compared to existing state-of-the-art approaches. Prediction and variable selection with datasets from Brassica napus, wheat and Arabidopsis thaliana diversity panels are conducted to further showcase the performance of the proposed model.Availability and implementation: The model is implemented using R programming language and the code is freely available from https://github.com/alainmbebi/L21-norm-GS.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Knockoff boosted tree for model-free variable selection

Bioinformatics ◽

10.1093/bioinformatics/btaa770 ◽

2020 ◽

Author(s):

Tao Jiang ◽

Yuanyuan Li ◽

Alison A Motsinger-Reif

Keyword(s):

Variable Selection ◽

Principal Component ◽

Free Variable ◽

Supplementary Information ◽

Type I ◽

Test Statistics ◽

Linear Regression Models ◽

Model Free ◽

Tree Models ◽

Boosted Tree

Abstract Motivation The recently proposed knockoff filter is a general framework for controlling the false discovery rate (FDR) when performing variable selection. This powerful new approach generates a ‘knockoff’ of each variable tested for exact FDR control. Imitation variables that mimic the correlation structure found within the original variables serve as negative controls for statistical inference. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. Here, we extend the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. However, currently available importance scores in tree models are insufficient for variable selection with FDR control. Results We propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extend the current knockoff method to model-free variable selection through the use of tree-based models. Additionally, we propose and evaluate two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test and compare these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compare the properties and performance of importance test statistics of tree models. The results include different combinations of knockoffs and importance test statistics. We consider scenarios that include main-effect, interaction, exponential and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using Cancer Genome Atlas (TCGA) gene expression data. Our results show improved discrimination between difficult-to-discriminate cancer types. Availability and implementation The proposed algorithm is included in the KOBT package, which is available at https://cran.r-project.org/web/packages/KOBT/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HiSCF: leveraging higher-order structures for clustering analysis in biological networks

Bioinformatics ◽

10.1093/bioinformatics/btaa775 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lun Hu ◽

Jun Zhang ◽

Xiangyu Pan ◽

Hong Yan ◽

Zhu-Hong You

Keyword(s):

Biological Networks ◽

Clustering Analysis ◽

Biological Network ◽

Higher Order ◽

Supplementary Information ◽

Network Motifs ◽

Functional Modules ◽

Connectivity Patterns ◽

Biological Entities ◽

Insight Into

Abstract Motivation Clustering analysis in a biological network is to group biological entities into functional modules, thus providing valuable insight into the understanding of complex biological systems. Existing clustering techniques make use of lower-order connectivity patterns at the level of individual biological entities and their connections, but few of them can take into account of higher-order connectivity patterns at the level of small network motifs. Results Here, we present a novel clustering framework, namely HiSCF, to identify functional modules based on the higher-order structure information available in a biological network. Taking advantage of higher-order Markov stochastic process, HiSCF is able to perform the clustering analysis by exploiting a variety of network motifs. When compared with several state-of-the-art clustering models, HiSCF yields the best performance for two practical clustering applications, i.e. protein complex identification and gene co-expression module detection, in terms of accuracy. The promising performance of HiSCF demonstrates that the consideration of higher-order network motifs gains new insight into the analysis of biological networks, such as the identification of overlapping protein complexes and the inference of new signaling pathways, and also reveals the rich higher-order organizational structures presented in biological networks. Availability and implementation HiSCF is available at https://github.com/allenv5/HiSCF. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Machine Learning Approach to Model Rock Strength: Prediction and Variable Selection with Aid of Log Data

Rock Mechanics and Rock Engineering ◽

10.1007/s00603-020-02184-2 ◽

2020 ◽

Vol 53 (10) ◽

pp. 4691-4715 ◽

Cited By ~ 1

Author(s):

Mohammad Islam Miah ◽

Salim Ahmed ◽

Sohrab Zendehboudi ◽

Stephen Butt

Keyword(s):

Machine Learning ◽

Variable Selection ◽

Rock Strength ◽

Strength Prediction ◽

Learning Approach ◽

Log Data ◽

Machine Learning Approach

Download Full-text

Insight into β-hairpin stability: a structural and thermodynamic study of diastereomeric β-hairpin mimeticsElectronic supplementary information (ESI) available: temperature and concentration-dependent chemical shifts and melting curves of the investigated molecules in different solvents and details of the X-ray analysis. See http://www.rsc.org/suppdata/nj/b1/b111241d/

New Journal of Chemistry ◽

10.1039/b111241d ◽

2002 ◽

Vol 26 (7) ◽

pp. 834-843 ◽

Cited By ~ 17

Author(s):

Máté Erdélyi ◽

Vratislav Langer ◽

Anders Karlén ◽

Adolf Gogoll

Keyword(s):

Chemical Shifts ◽

Thermodynamic Study ◽

Supplementary Information ◽

X Ray ◽

Melting Curves ◽

Insight Into

Download Full-text

An Update on Statistical Boosting in Biomedicine

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/6083072 ◽

2017 ◽

Vol 2017 ◽

pp. 1-12 ◽

Cited By ~ 4

Author(s):

Andreas Mayr ◽

Benjamin Hofner ◽

Elisabeth Waldmann ◽

Tobias Hepp ◽

Sebastian Meyer ◽

...

Keyword(s):

Variable Selection ◽

Target Function ◽

Statistical Modelling ◽

Functional Regression ◽

Learning Approach ◽

Time To Event ◽

Explanatory Variables ◽

Machine Learning Approach ◽

Regression Functions ◽

Boosting Algorithms

Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.

Download Full-text

Scandium and Dimethylaminopyridine Catalyzed Dehydrative Coupling of Secondary Benzylic and Primary Alcohols to Synthesize Unsymmetrical Ethers

10.26434/chemrxiv.12370574 ◽

2020 ◽

Author(s):

Julia Duncan ◽

Lun Li ◽

Vahid Mohammadrezaei ◽

Laina Geary

Keyword(s):

Spectroscopic Data ◽

Supplementary Information ◽

Primary Alcohols ◽

Benzylic Alcohols ◽

Benzylic Alcohol ◽

Catalytic Condensation ◽

Rapid Formation ◽

One Step ◽

Insight Into ◽

Mechanism Of The Reaction

We developed a direct catalytic condensation of benzylic alcohols and primary alcohols to synthesize unsymmetrical ethers in one step, catalyzed by scandium triflate and p-dimethylaminopyridine (DMAP). Preliminary experiments give some insight into the mechanism of the reaction, though suggest that the process is quite complex. We suspect the rapid formation of a dimer from a secondary benzylic alcohol via a carbocation intermediate precedes unsymmetrical ether formation. Full experimental details and spectroscopic data are provided as supplementary information.

Download Full-text

Destination Image of DMO and UGC on Instagram: A Machine-Learning Approach

Information and Communication Technologies in Tourism 2022 ◽

10.1007/978-3-030-94751-4_31 ◽

2022 ◽

pp. 343-355

Author(s):

Roman Egger ◽

Oguzcan Gumus ◽

Elza Kaiumova ◽

Richard Mükisch ◽

Veronika Surkic

Keyword(s):

Machine Learning ◽

Social Media ◽

Destination Image ◽

User Generated Content ◽

Learning Approach ◽

Projected Image ◽

Machine Learning Approach ◽

Destination Brand ◽

Practical Implications ◽

Insight Into

AbstractSocial media plays a key role in shaping the image of a destination. Although recent research has investigated factors influencing online users’ perception towards destination image, limited studies encompass and compare social media content shared by tourists and destination management organisations (DMOs) at the same time. This paper aims to determine whether the projected image of DMOs corresponds with the destination image perceived by tourists. By taking the Austrian Alpine resort Saalbach-Hinterglemm as a case, a netnographic approach was applied to analyse the visual and textual posts of DMO and user-generated content (UGC) on Instagram using machine learning. The findings reveal themes that are not covered in the posts published by marketers but do appear in UGC. This study adds to the existing literature by providing a deeper insight into destination image formation and uses a qualitative approach to assess destination brand image. It further highlights practical implications for the industry regarding DMOs’ social media marketing strategy.

Download Full-text

A Cross-Cultural, Exploratory Study of Students' Reluctance to Attend Office Hours

Learning and Teaching in Higher Education Gulf Perspectives ◽

10.18538/lthe.v6.n1.02 ◽

2009 ◽

Vol 6 (1) ◽

pp. 18-29

Author(s):

Yassir Semmar

Keyword(s):

Multivariate Analysis ◽

Exploratory Study ◽

Emotional State ◽

College Major ◽

Cross Cultural ◽

Communication Style ◽

Predictor Variables ◽

Credit Hours ◽

Course Characteristics ◽

Insight Into

The purpose of this study is to gain a better insight into the reasons that make Qatar University students reluctant to attend professors’ office hours. Factor analysis was first conducted to reveal the components underlying this reluctance; Multivariate Analysis of Variance (MANOVA) was then employed to analyze the effects of gender, GPA, credit hours completed, year of enrollment, and college/major on those factors. Results indicated that professor's competence and demeanor, course characteristics, students' social skills, attitudes/motivation, time conflict/communication style, students' apprehension as well as their physical/emotional state were all related to their reluctance to attend office hours. Moreover the predictor variables of gender, GPA, and credit hours completed had significant effects on several of those seven reluctance factors.

Download Full-text

How relationships between views, understandings, and implementations of inquiry-based teaching in biology contribute to science teacher identity

Nordic Studies in Science Education ◽

10.5617/nordina.7854 ◽

2021 ◽

Vol 17 (3) ◽

Author(s):

Anne Pellikka ◽

Sonja Lutovac ◽

Raimo Kaasila

Keyword(s):

Teacher Education ◽

Identity Development ◽

Science Teacher ◽

Teacher Identity ◽

Biology Education ◽

Primary Teachers ◽

Complex Relationships ◽

Research Period ◽

Insight Into ◽

Science Teacher Identity

This study examines the relationships between preservice primary teachers’ (PSTs) views, understandings, and implementations of inquiry-based teaching (IBT) in primary biology education. In earlier studies, these relationships have been researched separately. Exploring them simultaneously allows a greater insight into the process of teacher change and science teacher identity development. Drawing on the narrative method, data included learning diaries, lesson plans, and interviews during a two year research period. Our findings reveal the complex relationships between three aspects of IBT. For example, embracing views of IBT were sometimes accompanied by a significant understanding of IBT and other times by a weak understanding. Whereas, hesitant views of IBT also went together with significant understanding. We discuss these relationships in the light of their impact on science teacher identity and provide suggestions for teacher education.

Download Full-text