Effective and efficient feature selection for large-scale data using Bayes’ theorem

2009 ◽  
Vol 6 (1) ◽  
pp. 62-71 ◽  
Author(s):  
Subramanian Appavu Alias Balamurugan ◽  
Ramasamy Rajaram
2005 ◽  
Vol 9 (3) ◽  
pp. 237-251 ◽  
Author(s):  
Wei-Chou Chen ◽  
Ming-Chun Yang ◽  
Shian-Shyong Tseng

2014 ◽  
Vol 45 (1) ◽  
pp. 1-34 ◽  
Author(s):  
Ahmed K. Farahat ◽  
Ahmed Elgohary ◽  
Ali Ghodsi ◽  
Mohamed S. Kamel

2017 ◽  
Vol 49 (3) ◽  
pp. 151-159 ◽  
Author(s):  
Zhe Xue ◽  
Jia-Xu Chen ◽  
Yue Zhao ◽  
Barbara Medvar ◽  
Mark A. Knepper

A major challenge in physiology is to exploit the many large-scale data sets available from “-omic” studies to seek answers to key physiological questions. In previous studies, Bayes’ theorem has been used for this purpose. This approach requires a means to map continuously distributed experimental data to probabilities (likelihood values) to derive posterior probabilities from the combination of prior probabilities and new data. Here, we introduce the use of minimum Bayes’ factors for this purpose and illustrate the approach by addressing a physiological question, “Which deubiquitylating enzymes (DUBs) encoded by mammalian genomes are most likely to regulate plasma membrane transport processes in renal cortical collecting duct principal cells?” To do this, we have created a comprehensive online database of 110 DUBs present in the mammalian genome ( https://hpcwebapps.cit.nih.gov/ESBL/Database/DUBs/ ). We used Bayes’ theorem to integrate available information from large-scale data sets derived from proteomic and transcriptomic studies of renal collecting duct cells to rank the 110 known DUBs with regard to likelihood of interacting with and regulating transport processes. The top-ranked DUBs were OTUB1, USP14, PSMD7, PSMD14, USP7, USP9X, OTUD4, USP10, and UCHL5. Among these USP7, USP9X, OTUD4, and USP10 are known to be involved in endosomal trafficking and have potential roles in endosomal recycling of plasma membrane proteins in the mammalian cortical collecting duct.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Xi Shi ◽  
Gorana Nikolic ◽  
Gorka Epelde ◽  
Mónica Arrúe ◽  
Joseba Bidaurrazaga Van-Dierdonck ◽  
...  

Abstract Background The increasing prevalence of childhood obesity makes it essential to study the risk factors with a sample representative of the population covering more health topics for better preventive policies and interventions. It is aimed to develop an ensemble feature selection framework for large-scale data to identify risk factors of childhood obesity with good interpretability and clinical relevance. Methods We analyzed the data collected from 426,813 children under 18 during 2000–2019. A BMI above the 90th percentile for the children of the same age and gender was defined as overweight. An ensemble feature selection framework, Bagging-based Feature Selection framework integrating MapReduce (BFSMR), was proposed to identify risk factors. The framework comprises 5 models (filter with mutual information/SVM-RFE/Lasso/Ridge/Random Forest) from filter, wrapper, and embedded feature selection methods. Each feature selection model identified 10 variables based on variable importance. Considering accuracy, F-score, and model characteristics, the models were classified into 3 levels with different weights: Lasso/Ridge, Filter/SVM-RFE, and Random Forest. The voting strategy was applied to aggregate the selected features, with both feature weights and model weights taken into consideration. We compared our voting strategy with another two for selecting top-ranked features in terms of 6 dimensions of interpretability. Results Our method performed the best to select the features with good interpretability and clinical relevance. The top 10 features selected by BFSMR are age, sex, birth year, breastfeeding type, smoking habit and diet-related knowledge of both children and mothers, exercise, and Mother’s systolic blood pressure. Conclusion Our framework provides a solution for identifying a diverse and interpretable feature set without model bias from large-scale data, which can help identify risk factors of childhood obesity and potentially some other diseases for future interventions or policies.


2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Yue Hu ◽  
Ge Peng ◽  
Zehua Wang ◽  
Yanrong Cui ◽  
Hang Qin

For the data processing with increasing avalanche under large datasets, the k nearest neighbors (KNN) algorithm is a particularly expensive operation for both classification and regression predictive problems. To predict the values of new data points, it can calculate the feature similarity between each object in the test dataset and each object in the training dataset. However, due to expensive computational cost, the single computer is out of work to deal with large-scale dataset. In this paper, we propose an adaptive vKNN algorithm, which adopts on the Voronoi diagram under the MapReduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data. In the process of partition selection, we design a new predictive strategy for sample point to find the optimal relevant partition. Then, we can effectively collect irrelevant data, reduce KNN join computation, and improve the operation efficiency. Finally, we use a large number of 54-dimensional datasets to conduct a large number of experiments on the cluster. The experimental results show that our proposed method is effective and scalable with ensuring accuracy.


Sign in / Sign up

Export Citation Format

Share Document