Fuzzy C-Means in High Dimensional Spaces

High dimensions have a devastating effect on the FCM algorithm and similar algorithms. One effect is that the prototypes run into the centre of gravity of the entire data set. The objective function must have a local minimum in the centre of gravity that causes FCM’s behaviour. In this paper, examine this problem. This paper answers the following questions: How many dimensions are necessary to cause an ill behaviour of FCM? How does the number of prototypes influence the behaviour? Why has the objective function a local minimum in the centre of gravity? How must FCM be initialised to avoid the local minima in the centre of gravity? To understand the behaviour of the FCM algorithm and answer the above questions, the authors examine the values of the objective function and develop three test environments that consist of artificially generated data sets to provide a controlled environment. The paper concludes that FCM can only be applied successfully in high dimensions if the prototypes are initialized very close to the cluster centres.

Download Full-text

Searching for best lower dimensional visualization angles for high dimensional RNA-Seq data

PeerJ ◽

10.7717/peerj.5199 ◽

2018 ◽

Vol 6 ◽

pp. e5199

Author(s):

Wanli Zhang ◽

Yanming Di

Keyword(s):

Selection Criterion ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Rna Seq ◽

High Dimensions ◽

Data Set ◽

Complex Data Sets ◽

Hidden Patterns ◽

Scatterplot Matrix

The accumulation of RNA sequencing (RNA-Seq) gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experimentArabidopsisRNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.

Download Full-text

Logistic Regression Ensemble (LORENS) Applied to Drug Discovery

MATEMATIKA ◽

10.11113/matematika.v36.n1.1197 ◽

2020 ◽

Vol 36 (1) ◽

pp. 43-49

Author(s):

T Dwi Ary Widhianingsih ◽

Heri Kuswanto ◽

Dedy Dwi Prastyo

Keyword(s):

Logistic Regression ◽

Drug Discovery ◽

Objective Function ◽

Classification Performance ◽

High Dimensionality ◽

High Dimensional ◽

Classification Methods ◽

Data Set ◽

Computational Burden ◽

Cancerous Cells

Logistic regression is one of the commonly used classification methods. It has some advantages, specifically related to hypothesis testing and its objective function. However, it also has some disadvantages in the case of high-dimensional data, such as multicolinearity, over-fitting, and a high computational burden. Ensemblebased classification methods have been proposed to overcome these problems. The logistic regression ensemble (LORENS) method is expected to improve the classification performance of basic logistic regression. In this paper, we apply it to the case of drug discovery with the objective of obtaining candidate compounds to protect the normal non-cancerous cells, which is considered to be a problem with a data-set of high dimensionality. The experimental results show that it performs well, with an accuracy of 69% and AUC of 0.7306.

Download Full-text

Dimensionality and Its Reduction

Statistics, Data Mining, and Machine Learning in Astronomy ◽

10.23943/princeton/9780691151687.003.0007 ◽

2014 ◽

Author(s):

Andrew J. Connolly ◽

Jacob T. VanderPlas ◽

Alexander Gray ◽

Andrew J. Connolly ◽

Jacob T. VanderPlas ◽

...

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Reduction Technique ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Gaussian Distributions ◽

Dimensionality Reduction Technique ◽

Alternative Techniques ◽

New Generation

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.

Download Full-text

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Genes ◽

10.3390/genes11070717 ◽

2020 ◽

Vol 11 (7) ◽

pp. 717

Author(s):

Garba Abdulrauf Sharifai ◽

Zurinahni Zainol

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Imbalanced Data ◽

High Dimensional ◽

Data Sets ◽

Biomedical Data ◽

Data Set ◽

Grasshopper Optimization Algorithm ◽

Imbalanced Class ◽

Grasshopper Optimization

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Cleaning Data Sets with Diagnostic Errors in the High-Dimensional Feature Spaces

Математическая биология и биоинформатика ◽

10.17537/2019.14.464 ◽

2019 ◽

Vol 14 (2) ◽

pp. 464-476

Author(s):

I.A. Borisova ◽

O.A. Kutnenko

Keyword(s):

Diagnostic Errors ◽

Medical Diagnostics ◽

Maximal Class ◽

High Dimensional ◽

Data Sets ◽

Local Similarity ◽

Data Set ◽

Feature Spaces ◽

Feature Subspace ◽

Low Dimensional

The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering, both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm (FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis. The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.

Download Full-text

New objective functions for waveform inversion

Geophysics ◽

10.1190/1.1444315 ◽

1998 ◽

Vol 63 (1) ◽

pp. 213-222 ◽

Cited By ~ 15

Author(s):

L. Neil Frazer ◽

Xinhua Sun

Keyword(s):

Green’S Function ◽

Objective Function ◽

Impulse Response ◽

Green's Function ◽

Waveform Inversion ◽

Data Sets ◽

Minimum Phase ◽

Objective Functions ◽

Data Set ◽

Parameter Values

Inversion is an organized search for parameter values that maximize or minimize an objective function, referred to here as a processor. This note derives three new seismic processors that require neither prior deconvolution nor knowledge of the source‐receiver wavelet. The most powerful of these is the fourwise processor, as it is applicable to data sets from multiple shots and receivers even when each shot has a different unknown signature and each receiver has a different unknown impulse response. Somewhat less powerful than the fourwise processor is the pairwise processor, which is applicable to a data set consisting of two or more traces with the same unknown wavelet but possibly different gains. When only one seismogram exists the partition processor can be used. The partition processor is also applicable when there is only one shot (receiver) and each receiver (shot) has a different signature. In fourwise and pairwise inversions the unknown wavelets may be arbitrarily long in time and need not be minimum phase. In partition inversion the wavelet is assumed to be shorter in time than the data trace itself but is not otherwise restricted. None of the methods requires assumptions about the Green’s function.

Download Full-text

Determination of Optimal Clusters Using a Genetic Algorithm

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch005 ◽

2008 ◽

pp. 98-117 ◽

Cited By ~ 1

Author(s):

Tushar ◽

Shibendu Shekhar Roy ◽

Dilip Kumar Pratihar

Keyword(s):

Genetic Algorithm ◽

Threshold Value ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Fcm Algorithm ◽

Data Points ◽

The Relationship

Clustering is a potential tool of data mining. A clustering method analyzes the pattern of a data set and groups the data into several clusters based on the similarity among themselves. Clusters may be either crisp or fuzzy in nature. The present chapter deals with clustering of some data sets using Fuzzy C-Means (FCM) algorithm and Entropy-based Fuzzy Clustering (EFC) algorithm. In FCM algorithm, the nature and quality of clusters depend on the pre-defined number of clusters, level of cluster fuzziness and a threshold value utilized for obtaining the number of outliers (if any). On the other hand, the quality of clusters obtained by the EFC algorithm is dependent on a constant used to establish the relationship between the distance and similarity of two data points, a threshold value of similarity and another threshold value used for determining the number of outliers. The clusters should ideally be distinct and at the same time compact in nature. Moreover, the number of outliers should be as minimum as possible. Thus, the above problem may be posed as an optimization problem, which will be solved using a Genetic Algorithm (GA). The best set of multi-dimensional clusters will be mapped into 2-D for visualization using a Self-Organizing Map (SOM).

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text