scholarly journals Bayesian network feature finder (BANFF): an R package for gene network feature selection: Table 1.

2016 ◽  
pp. btw522 ◽  
Author(s):  
Zhou Lan ◽  
Yize Zhao ◽  
Jian Kang ◽  
Tianwei Yu
2021 ◽  
Vol 15 (4) ◽  
pp. 1-46
Author(s):  
Kui Yu ◽  
Lin Liu ◽  
Jiuyong Li

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e10849
Author(s):  
Maximilian Knoll ◽  
Jennifer Furkel ◽  
Juergen Debus ◽  
Amir Abdollahi

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).


2017 ◽  
Vol 34 (9) ◽  
pp. 1571-1573 ◽  
Author(s):  
Xiao-Fei Zhang ◽  
Le Ou-Yang ◽  
Shuo Yang ◽  
Xiaohua Hu ◽  
Hong Yan

Author(s):  
Edmund Jones ◽  
Vanessa Didelez

In one procedure for finding the maximal prime decomposition of a Bayesian network or undirected graphical model, the first step is to create a minimal triangulation of the network, and a common and straightforward way to do this is to create a triangulation that is not necessarily minimal and then thin this triangulation by removing excess edges. We show that the algorithm for thinning proposed in several previous publications is incorrect. A different version of this algorithm is available in the R package gRbase, but its correctness has not previously been proved. We prove that this version is correct and provide a simpler version, also with a proof. We compare the speed of the two corrected algorithms in three ways and find that asymptotically their speeds are the same, neither algorithm is consistently faster than the other, and in a computer experiment the algorithm used by gRbase is faster when the original graph is large, dense, and undirected, but usually slightly slower when it is directed.


2010 ◽  
Vol 73 (4-6) ◽  
pp. 613-621 ◽  
Author(s):  
C.P. Lim ◽  
S.L. Wang ◽  
K.S. Tan ◽  
J. Navarro ◽  
L.C. Jain

Author(s):  
Daniele Mercatelli ◽  
Gonzalo Lopez-Garcia ◽  
Federico M. Giorgi

AbstractMotivationGene Network Inference and Master Regulator Analysis (MRA) have been widely adopted to define specific transcriptional perturbations from gene expression signatures. Several tools exist to perform such analyses, but most require a computer cluster or large amounts of RAM to be executed.ResultsWe developed corto, a fast and lightweight R package to infer gene networks and perform MRA from gene expression data, with optional corrections for Copy Number Variations (CNVs) and able to run on signatures generated from RNA-Seq or ATAC-Seq data. We extensively benchmarked it to infer context-specific gene networks in 39 human tumor and 27 normal tissue datasets.AvailabilityCross-platform and multi-threaded R package on CRAN (stable version) https://cran.rproject.org/package=corto and Github (development release) https://github.com/federicogiorgi/[email protected]


2018 ◽  
Vol 35 (16) ◽  
pp. 2865-2867 ◽  
Author(s):  
Tallulah S Andrews ◽  
Martin Hemberg

Abstract Motivation Most genomes contain thousands of genes, but for most functional responses, only a subset of those genes are relevant. To facilitate many single-cell RNASeq (scRNASeq) analyses the set of genes is often reduced through feature selection, i.e. by removing genes only subject to technical noise. Results We present M3Drop, an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show these new methods outperform existing methods on simulated and real datasets. Availability and implementation M3Drop is freely available on github as an R package and is compatible with other popular scRNASeq tools: https://github.com/tallulandrews/M3Drop. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document