A Fast Clustering Algorithm for Large-scale and High Dimensional Data

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Soft Computing ◽

10.1007/s00500-021-05973-1 ◽

2021 ◽

Author(s):

Parul Agarwal ◽

Shikha Mehta ◽

Ajith Abraham

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

Download Full-text

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text