scholarly journals Feature selection algorithms for very high dimensional data and mixed data

2008 ◽  
Author(s):  
Wen Yin Tang
2021 ◽  
Vol 26 (1) ◽  
pp. 67-77
Author(s):  
Siva Sankari Subbiah ◽  
Jayakumar Chinnappan

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Jing Zhang ◽  
Guang Lu ◽  
Jiaquan Li ◽  
Chuanwen Li

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.


Sign in / Sign up

Export Citation Format

Share Document