Feature selection using autoencoders with Bayesian methods to high-dimensional data

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

A REVIEW OF FEATURE EXTRACTION METHODS ON MACHINE LEARNING

Journal of Information System and Technology Management ◽

10.35631/jistm.622005 ◽

2021 ◽

Vol 6 (22) ◽

pp. 51-59

Author(s):

Mustazzihim Suhaidi ◽

Rabiah Abdul Kadir ◽

Sabrina Tiun

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

Input Data ◽

Feature Vector ◽

Learning Algorithm ◽

Extraction Methods ◽

Machine Learning Algorithm ◽

Learning Tasks ◽

Low Dimensional

Extracting features from input data is vital for successful classification and machine learning tasks. Classification is the process of declaring an object into one of the predefined categories. Many different feature selection and feature extraction methods exist, and they are being widely used. Feature extraction, obviously, is a transformation of large input data into a low dimensional feature vector, which is an input to classification or a machine learning algorithm. The task of feature extraction has major challenges, which will be discussed in this paper. The challenge is to learn and extract knowledge from text datasets to make correct decisions. The objective of this paper is to give an overview of methods used in feature extraction for various applications, with a dataset containing a collection of texts taken from social media.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

International Journal of Engineering ◽

10.5829/ije.2020.33.02b.05 ◽

2020 ◽

Vol 33 (2) ◽

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Hybrid Approach ◽

Small Sample ◽

High Dimensional ◽

Selection For

Download Full-text

An Effective Multi-Label Feature Selection Model Towards Eliminating Noisy Features

Applied Sciences ◽

10.3390/app10228093 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8093

Author(s):

Jun Wang ◽

Yuanyuan Xu ◽

Hengpeng Xu ◽

Zhe Sun ◽

Zhenglu Yang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Performance ◽

Space Structures ◽

Learning Tasks ◽

Feature Spaces ◽

Selection Approach ◽

Label Correlations ◽

Feature Selection Approach ◽

Low Dimensional

Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, that is, the irrelevant and redundant ones, which are collectively referred to as noisy features. These features may hamper the construction of optimal low-dimensional subspaces and compromise the learning performance of downstream tasks. In this study, we propose a novel multi-label feature selection approach by embedding label correlations (dubbed ELC) to address these issues. Particularly, we extract label correlations for reliable label space structures and employ them to steer feature selection. In this way, label and feature spaces can be expected to be consistent and noisy features can be effectively eliminated. An extensive experimental evaluation on public benchmarks validated the superiority of ELC.

Download Full-text

BSSReduce an $O(\left|U\right|)$ Incremental Feature Selection Approach for Large-Scale and High-Dimensional Data

IEEE Transactions on Fuzzy Systems ◽

10.1109/tfuzz.2018.2825308 ◽

2018 ◽

Vol 26 (6) ◽

pp. 3356-3367 ◽

Cited By ~ 13

Author(s):

Ke Gong ◽

Yong Wang ◽

Maozeng Xu ◽

Zhi Xiao

Keyword(s):

Feature Selection ◽

Large Scale ◽

High Dimensional Data ◽

High Dimensional ◽

Incremental Feature Selection ◽

Selection Approach ◽

Feature Selection Approach

Download Full-text

Feature Selection Techniques in High Dimensional Data With Machine Learning and Deep Learning

Advances in Data Mining and Database Management - Handbook of Research on Automated Feature Engineering and Advanced Applications in Data Science ◽

10.4018/978-1-7998-6659-6.ch002 ◽

2021 ◽

pp. 17-37

Author(s):

Bhanu Chander

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Deep Learning ◽

High Dimensional Data ◽

Complete Information ◽

High Dimensional ◽

Future Research ◽

Future Research Directions ◽

Class Labels ◽

Feature Selection Techniques

High-dimensional data inspection is one of the major disputes for researchers plus engineers in domains of deep learning (DL), machine learning (ML), as well as data mining. Feature selection (FS) endows with proficient manner to determine these difficulties through eradicating unrelated and outdated data, which be capable of reducing calculation time, progress learns precision, and smooth the progress of an enhanced understanding of the learning representation or information. To eradicate an inappropriate feature, an FS standard was essential, which can determine the significance of every feature in the company of the output class/labels. Filter schemes employ variable status procedure as the standard criterion for variable collection by means of ordering. Ranking schemes utilized since their straightforwardness and high-quality accomplishment are detailed for handy appliances. The goal of this chapter is to produce complete information on FS approaches, its applications, and future research directions.

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text

BagMeLiF: stable boosting-based hybrid-ensemble feature selection algorithm for high-dimensional data

2020 International Conference on Control, Robotics and Intelligent System ◽

10.1145/3437802.3437835 ◽

2020 ◽

Author(s):

Nikita Pilnenskiy ◽

Ivan Smetannikov

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm

Download Full-text