Two Steps Genetic Programming for Big Data - Perspective of Distributed and High-Dimensional Data

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465.v1 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text

AN EFFECTIVE FEATURE SELECTION TECHNIQUE FOR MINING HIGH DIMENSIONAL DATA ON BIG DATA

i-manager’s Journal on Cloud Computing ◽

10.26634/jcc.3.1.8075 ◽

2016 ◽

Vol 3 (1) ◽

pp. 18

Author(s):

NAIK K. BHASKAR ◽

SINDHUJA S.P. ◽

◽

Keyword(s):

Feature Selection ◽

Big Data ◽

High Dimensional Data ◽

High Dimensional ◽

Feature Selection Technique ◽

Selection Technique

Download Full-text

Big Data in Cloud Computing

Applications of Big Data in Large- and Small-Scale Systems - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-6673-2.ch005 ◽

2021 ◽

pp. 77-84

Author(s):

Jayashree K. ◽

Swaminathan B.

Keyword(s):

Cloud Computing ◽

Big Data ◽

Social Network ◽

High Dimensional Data ◽

High Dimensional ◽

Data Sets ◽

It Services ◽

Delivery Model ◽

Research Challenges ◽

Future Direction

The huge size of data that has been produced by applications that spans from social network to scientific computing is termed big data. Cloud computing as a delivery model for IT services enhances business productivity by reducing cost. It has the intention of achieving solution for managing big data such as high dimensional data sets. Thus, this chapter discusses the background of big data and cloud computing. It also discusses the various application of big data in detail. The various related work, research challenges of big data in cloud computing, and the future direction are addressed in this chapter.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

Memetic Computing ◽

10.1007/s12293-015-0173-y ◽

2015 ◽

Vol 8 (1) ◽

pp. 3-15 ◽

Cited By ~ 72

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

High Dimensional Data ◽

High Dimensional ◽

Feature Construction

Download Full-text

Class Dependent Multiple Feature Construction Using Genetic Programming for High-Dimensional Data

AI 2017: Advances in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-319-63004-5_15 ◽

2017 ◽

pp. 182-194 ◽

Cited By ~ 1

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

High Dimensional Data ◽

High Dimensional ◽

Feature Construction ◽

Multiple Feature

Download Full-text

Visualisation of High Dimensional Data by Use of Genetic Programming: Application to On-line Infrared Spectroscopy Based Process Monitoring

Advances in Intelligent Systems and Computing - Soft Computing in Industrial Applications ◽

10.1007/978-3-319-00930-8_20 ◽

2013 ◽

pp. 223-231

Author(s):

Tibor Kulcsar ◽

Gabor Bereznai ◽

Gabor Sarossy ◽

Robert Auer ◽

Janos Abonyi

Keyword(s):

Infrared Spectroscopy ◽

Genetic Programming ◽

Process Monitoring ◽

High Dimensional Data ◽

High Dimensional ◽

On Line

Download Full-text

Networked Exponential Families For Big Data Over Networks

10.36227/techrxiv.12674198 ◽

2020 ◽

Author(s):

Alexander Jung

Keyword(s):

Machine Learning ◽

Big Data ◽

Network Structure ◽

Message Passing ◽

Likelihood Function ◽

High Dimensional Data ◽

Exponential Families ◽

High Dimensional ◽

Parameter Estimates ◽

Data Points

We propose networked exponential families for non-parametric machine learning from massive network-structured datasets (“big data over networks”). High-dimensional data points are interpreted as the realizations of a random process distributed according to some exponential family. Networked exponential families allow to jointly leverage the information contained in high-dimensional data points and their network structure. For data points representing individuals, we obtain perfectly personalized models which enable high-precision medicine or more general recommendation systems.We learn the parameters of networked exponential families, using the network Lasso which implicitly pools (or clusters) the data points according to the intrinsic network structure and a local likelihood function. Our main theoretical result characterizes how the accuracy of network Lasso depends on the network structure and the information geometry of the node-wise exponential families. The network Lasso can be implemented as highly scalable message-passing over the data network. Such message passing is appealing for federated machine learning relying on edge computing. The proposed method is also privacy preserving in the sense that no raw data but only parameter (estimates) are shared among different nodes.

Download Full-text

Two Steps Genetic Programming for Big Data - Perspective of Distributed and High-Dimensional Data

Genetic Programming Based on Granular Computing for Classification with High-Dimensional Data

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Genetic programming for feature construction and selection in classification on high-dimensional data

Genetic programming for feature construction and selection in classification on high-dimensional data

AN EFFECTIVE FEATURE SELECTION TECHNIQUE FOR MINING HIGH DIMENSIONAL DATA ON BIG DATA

Big Data in Cloud Computing

Genetic programming for feature construction and selection in classification on high-dimensional data

Class Dependent Multiple Feature Construction Using Genetic Programming for High-Dimensional Data

Visualisation of High Dimensional Data by Use of Genetic Programming: Application to On-line Infrared Spectroscopy Based Process Monitoring

Networked Exponential Families For Big Data Over Networks

Export Citation Format