An Optimized Semi-Supervised Learning Approach for High Dimensional Datasets

Author(s):  
Nesma Settouti ◽  
Mostafa El Habib Daho ◽  
Mohammed El Amine Bechar ◽  
Mohammed Amine Chikh

The semi-supervised learning is one of the most interesting fields for research developments in the machine learning domain beyond the scope of supervised learning from data. Medical diagnostic process works mostly in supervised mode, but in reality, we are in the presence of a large amount of unlabeled samples and a small set of labeled examples characterized by thousands of features. This problem is known under the term “the curse of dimensionality”. In this study, we propose, as solution, a new approach in semi-supervised learning that we would call Optim Co-forest. The Optim Co-forest algorithm combines the re-sampling data approach (Bagging Breiman, 1996) with two selection strategies. The first one involves selecting random subset of parameters to construct the ensemble of classifiers following the principle of Co-forest (Li & Zhou, 2007). The second strategy is an extension of the importance measure of Random Forest (RF; Breiman, 2001). Experiments on high dimensional datasets confirm the power of the adopted selection strategies in the scalability of our method.

2019 ◽  
Vol 21 (2) ◽  
pp. 421-428 ◽  
Author(s):  
Alex A Freitas

Abstract An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning–based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area has, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson’s paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson’s paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson’s paradox involving top-ranked predictors are much more common for one of the feature ranking methods.


2018 ◽  
Vol 2018 (15) ◽  
pp. 132-1-1323
Author(s):  
Shijie Zhang ◽  
Zhengtian Song ◽  
G. M. Dilshan P. Godaliyadda ◽  
Dong Hye Ye ◽  
Atanu Sengupta ◽  
...  

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 779
Author(s):  
Ruriko Yoshida

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.


2016 ◽  
Vol 2016 (1) ◽  
pp. 4-19 ◽  
Author(s):  
Andreas Kurtz ◽  
Hugo Gascon ◽  
Tobias Becker ◽  
Konrad Rieck ◽  
Felix Freiling

Abstract Recently, Apple removed access to various device hardware identifiers that were frequently misused by iOS third-party apps to track users. We are, therefore, now studying the extent to which users of smartphones can still be uniquely identified simply through their personalized device configurations. Using Apple’s iOS as an example, we show how a device fingerprint can be computed using 29 different configuration features. These features can be queried from arbitrary thirdparty apps via the official SDK. Experimental evaluations based on almost 13,000 fingerprints from approximately 8,000 different real-world devices show that (1) all fingerprints are unique and distinguishable; and (2) utilizing a supervised learning approach allows returning users or their devices to be recognized with a total accuracy of 97% over time


Sign in / Sign up

Export Citation Format

Share Document