Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

Large-scale distributed and scalable SOM-based architecture for high-dimensional data reduction

AI for Emerging Verticals: Human-robot computing, sensing and networking ◽

10.1049/pbpc034e_ch16 ◽

2020 ◽

pp. 315-336

Keyword(s):

Data Reduction ◽

Large Scale ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Visualizing Large-scale and High-dimensional Data

Proceedings of the 25th International Conference on World Wide Web - WWW '16 ◽

10.1145/2872427.2883041 ◽

2016 ◽

Cited By ~ 89

Author(s):

Jian Tang ◽

Jingzhou Liu ◽

Ming Zhang ◽

Qiaozhu Mei

Keyword(s):

Large Scale ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Natural scene representation for visualizing high-dimensional data in a large-scale time-critical environment

10.1117/12.270316 ◽

1997 ◽

Author(s):

Gary K. Ng ◽

Carole A. Goble ◽

Adrian J. West

Keyword(s):

Large Scale ◽

High Dimensional Data ◽

High Dimensional ◽

Natural Scene ◽

Scene Representation ◽

Time Critical

Download Full-text

Unsupervised Learning for Large Scale Data: The ATHLOS Project

10.1101/2021.04.01.21254751 ◽

2021 ◽

Author(s):

Petros Barmpas ◽

Sotiris Tasoulis ◽

Aristidis G. Vrahatis ◽

Panagiotis Anagnostou ◽

Spiros Georgakopoulos ◽

...

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Large Scale ◽

High Dimensional Data ◽

Experimental Studies ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Types ◽

Unified Framework

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.

Download Full-text

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text