scholarly journals Cleaning Data Sets with Diagnostic Errors in the High-Dimensional Feature Spaces

Author(s):  
I.A. Borisova ◽  
O.A. Kutnenko

The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering, both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm (FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis. The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.

2014 ◽  
Vol 519-520 ◽  
pp. 661-666
Author(s):  
Qing Zhu ◽  
Jie Zhang

Abstract. This paper proposes an incomplete GEI gait recognition method based on Random Forests. There are numerous methods exist for git recognition,but they all lead to high dimensional feature spaces. To address the problem of high dimensional feature space, we propose the use of the Random Forest algorithm to rank features' importance . In order to efficiently search throughout subspaces, we apply a backward feature elimination search strategy.This demonstrate static areas of a GEI also contain useful information.Then, we project the selected feature to a low-dimensional feature subspace via the newly proposed two-dimensional locality preserving projections (2DLPP) method.Asa sequence,we further improve the discriminative power of the extracted features. Experimental results on the CASIA gait database demonstrate the effectiveness of the proposed method.


2021 ◽  
Author(s):  
Blaž Škrlj ◽  
Sašo Džeroski ◽  
Nada Lavrač ◽  
Matej Petković

AbstractFeature ranking has been widely adopted in machine learning applications such as high-throughput biology and social sciences. The approaches of the popular Relief family of algorithms assign importances to features by iteratively accounting for nearest relevant and irrelevant instances. Despite their high utility, these algorithms can be computationally expensive and not-well suited for high-dimensional sparse input spaces. In contrast, recent embedding-based methods learn compact, low-dimensional representations, potentially facilitating down-stream learning capabilities of conventional learners. This paper explores how the Relief branch of algorithms can be adapted to benefit from (Riemannian) manifold-based embeddings of instance and target spaces, where a given embedding’s dimensionality is intrinsic to the dimensionality of the considered data set. The developed ReliefE algorithm is faster and can result in better feature rankings, as shown by our evaluation on 20 real-life data sets for multi-class and multi-label classification tasks. The utility of ReliefE for high-dimensional data sets is ensured by its implementation that utilizes sparse matrix algebraic operations. Finally, the relation of ReliefE to other ranking algorithms is studied via the Fuzzy Jaccard Index.


2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


Author(s):  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
Alexander Gray ◽  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
...  

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.


Author(s):  
Roland Winkler ◽  
Frank Klawonn ◽  
Rudolf Kruse

High dimensions have a devastating effect on the FCM algorithm and similar algorithms. One effect is that the prototypes run into the centre of gravity of the entire data set. The objective function must have a local minimum in the centre of gravity that causes FCM’s behaviour. In this paper, examine this problem. This paper answers the following questions: How many dimensions are necessary to cause an ill behaviour of FCM? How does the number of prototypes influence the behaviour? Why has the objective function a local minimum in the centre of gravity? How must FCM be initialised to avoid the local minima in the centre of gravity? To understand the behaviour of the FCM algorithm and answer the above questions, the authors examine the values of the objective function and develop three test environments that consist of artificially generated data sets to provide a controlled environment. The paper concludes that FCM can only be applied successfully in high dimensions if the prototypes are initialized very close to the cluster centres.


Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 717
Author(s):  
Garba Abdulrauf Sharifai ◽  
Zurinahni Zainol

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.


2011 ◽  
Vol 1 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Roland Winkler ◽  
Frank Klawonn ◽  
Rudolf Kruse

High dimensions have a devastating effect on the FCM algorithm and similar algorithms. One effect is that the prototypes run into the centre of gravity of the entire data set. The objective function must have a local minimum in the centre of gravity that causes FCM’s behaviour. In this paper, examine this problem. This paper answers the following questions: How many dimensions are necessary to cause an ill behaviour of FCM? How does the number of prototypes influence the behaviour? Why has the objective function a local minimum in the centre of gravity? How must FCM be initialised to avoid the local minima in the centre of gravity? To understand the behaviour of the FCM algorithm and answer the above questions, the authors examine the values of the objective function and develop three test environments that consist of artificially generated data sets to provide a controlled environment. The paper concludes that FCM can only be applied successfully in high dimensions if the prototypes are initialized very close to the cluster centres.


Algorithms ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 126 ◽  
Author(s):  
Feiyang Chen ◽  
Ying Jiang ◽  
Xiangrui Zeng ◽  
Jing Zhang ◽  
Xin Gao ◽  
...  

Salient segmentation is a critical step in biomedical image analysis, aiming to cut out regions that are most interesting to humans. Recently, supervised methods have achieved promising results in biomedical areas, but they depend on annotated training data sets, which requires labor and proficiency in related background knowledge. In contrast, unsupervised learning makes data-driven decisions by obtaining insights directly from the data themselves. In this paper, we propose a completely unsupervised self-aware network based on pre-training and attentional backpropagation for biomedical salient segmentation, named as PUB-SalNet. Firstly, we aggregate a new biomedical data set from several simulated Cellular Electron Cryo-Tomography (CECT) data sets featuring rich salient objects, different SNR settings, and various resolutions, which is called SalSeg-CECT. Based on the SalSeg-CECT data set, we then pre-train a model specially designed for biomedical tasks as a backbone module to initialize network parameters. Next, we present a U-SalNet network to learn to selectively attend to salient objects. It includes two types of attention modules to facilitate learning saliency through global contrast and local similarity. Lastly, we jointly refine the salient regions together with feature representations from U-SalNet, with the parameters updated by self-aware attentional backpropagation. We apply PUB-SalNet for analysis of 2D simulated and real images and achieve state-of-the-art performance on simulated biomedical data sets. Furthermore, our proposed PUB-SalNet can be easily extended to 3D images. The experimental results on the 2d and 3d data sets also demonstrate the generalization ability and robustness of our method.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


Author(s):  
Bharat Gupta ◽  
Durga Toshniwal

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.


Sign in / Sign up

Export Citation Format

Share Document