Two Steps Genetic Programming for Big Data - Perspective of Distributed and High-Dimensional Data

Author(s):  
Jih-Jeng Huang
2021 ◽  
Vol 26 (1) ◽  
pp. 67-77
Author(s):  
Siva Sankari Subbiah ◽  
Jayakumar Chinnappan

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.


2021 ◽  
Author(s):  
Binh Tran ◽  
Bing Xue ◽  
Mengjie Zhang

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.


2021 ◽  
Author(s):  
Binh Tran ◽  
Bing Xue ◽  
Mengjie Zhang

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.


Author(s):  
Jayashree K. ◽  
Swaminathan B.

The huge size of data that has been produced by applications that spans from social network to scientific computing is termed big data. Cloud computing as a delivery model for IT services enhances business productivity by reducing cost. It has the intention of achieving solution for managing big data such as high dimensional data sets. Thus, this chapter discusses the background of big data and cloud computing. It also discusses the various application of big data in detail. The various related work, research challenges of big data in cloud computing, and the future direction are addressed in this chapter.


2020 ◽  
Author(s):  
Alexander Jung

We propose networked exponential families for non-parametric<br>machine learning from massive network-structured datasets<br>(“big data over networks”). High-dimensional data points are<br>interpreted as the realizations of a random process distributed<br>according to some exponential family. Networked exponential<br>families allow to jointly leverage the information contained<br>in high-dimensional data points and their network structure.<br>For data points representing individuals, we obtain perfectly<br>personalized models which enable high-precision medicine or<br>more general recommendation systems.We learn the parameters<br>of networked exponential families, using the network Lasso<br>which implicitly pools (or clusters) the data points according to<br>the intrinsic network structure and a local likelihood function.<br>Our main theoretical result characterizes how the accuracy<br>of network Lasso depends on the network structure and the<br>information geometry of the node-wise exponential families.<br>The network Lasso can be implemented as highly scalable<br>message-passing over the data network. Such message passing<br>is appealing for federated machine learning relying on edge<br>computing. The proposed method is also privacy preserving in<br>the sense that no raw data but only parameter (estimates) are<br>shared among different nodes.


Sign in / Sign up

Export Citation Format

Share Document