Effect of Feature Selection on Gene Expression Datasets Classification Accurac

Feature selection and tuple selection help the classifier to focus to achieve similar (or even better) accuracy as compared to the classification without feature selection and tuple selection. Although feature selection and tuple selection have been studied earlier in various research areas such as machine learning, data mining, and so on, they have rarely been studied together. The contribution of this chapter is that the authors propose a novel distance measure to select the most representative features and tuples. Their experiments are conducted over some microarray gene expression datasets, UCI machine learning and KDD datasets. Results show that the proposed method outperforms the existing methods quite significantly.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text

Improving the Intrusion Detection using Discriminative Machine Learning Approach and Improve the Time Complexity by Data Mining Feature Selection Methods

International Journal of Computer Applications ◽

10.5120/13209-0587 ◽

2013 ◽

Vol 76 (1) ◽

pp. 5-11 ◽

Cited By ~ 14

Author(s):

Karan Bajaj ◽

Amit Arora

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Intrusion Detection ◽

Time Complexity ◽

Learning Approach ◽

Selection Methods ◽

Machine Learning Approach

Download Full-text

Performance evaluation of random forest with feature selection methods in prediction of diabetes

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i1.pp353-359 ◽

2020 ◽

Vol 10 (1) ◽

pp. 353

Author(s):

Raghavendra S ◽

Santosh Kumar J

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Random Forest ◽

Evaluation Method ◽

Learning Algorithm ◽

Selection Methods ◽

Forward Selection ◽

Data Set ◽

Prediction Of Diabetes

<p>Data mining is nothing but the process of viewing data in different angle and compiling it into appropriate information. Recent improvements in the area of data mining and machine learning have empowered the research in biomedical field to improve the condition of general health care. Since the wrong classification may lead to poor prediction, there is a need to perform the better classification which further improves the prediction rate of the medical datasets. When medical data mining is applied on the medical datasets the important and difficult challenges are the classification and prediction. In this proposed work we evaluate the PIMA Indian Diabtes data set of UCI repository using machine learning algorithm like Random Forest along with feature selection methods such as forward selection and backward elimination based on entropy evaluation method using percentage split as test option. The experiment was conducted using R studio platform and we achieved classification accuracy of 84.1%. From results we can say that Random Forest predicts diabetes better than other techniques with less number of attributes so that one can avoid least important test for identifying diabetes.</p>

Download Full-text

Feature Selection Algorithms for Classification and Clustering

Ubiquitous Machine Learning and Its Applications - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-2545-5.ch007 ◽

2017 ◽

pp. 143-167

Author(s):

Arvind Kumar Tiwari

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Learning Algorithm ◽

High Dimensional ◽

Filter Method ◽

Selection Methods ◽

Wrapper Method ◽

Embedded Method ◽

Selection Algorithms

Feature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This chapter presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This chapter reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.

Download Full-text

Feature Selection Algorithms for Classification and Clustering

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch022 ◽

2020 ◽

pp. 422-442

Author(s):

Arvind Kumar Tiwari

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Learning Algorithm ◽

High Dimensional ◽

Filter Method ◽

Selection Methods ◽

Wrapper Method ◽

Embedded Method ◽

Selection Algorithms

Feature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This chapter presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This chapter reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.

Download Full-text

Survival prognostic factors in patients with acute myeloid leukemia using machine learning techniques

PLoS ONE ◽

10.1371/journal.pone.0254976 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254976

Author(s):

Keyvan Karami ◽

Mahboubeh Akbari ◽

Mohammad-Taher Moradi ◽

Bijan Soleymani ◽

Hossein Fallahi

Keyword(s):

Machine Learning ◽

Data Mining ◽

Acute Myeloid Leukemia ◽

Feature Selection ◽

Predictive Value ◽

Myeloid Leukemia ◽

Machine Learning Techniques ◽

Selection Methods ◽

Learning Techniques ◽

Acute Myeloid

This paper identifies prognosis factors for survival in patients with acute myeloid leukemia (AML) using machine learning techniques. We have integrated machine learning with feature selection methods and have compared their performances to identify the most suitable factors in assessing the survival of AML patients. Here, six data mining algorithms including Decision Tree, Random Forrest, Logistic Regression, Naive Bayes, W-Bayes Net, and Gradient Boosted Tree (GBT) are employed for the detection model and implemented using the common data mining tool RapidMiner and open-source R package. To improve the predictive ability of our model, a set of features were selected by employing multiple feature selection methods. The accuracy of classification was obtained using 10-fold cross-validation for the various combinations of the feature selection methods and machine learning algorithms. The performance of the models was assessed by various measurement indexes including accuracy, kappa, sensitivity, specificity, positive predictive value, negative predictive value, and area under the ROC curve (AUC). Our results showed that GBT with an accuracy of 85.17%, AUC of 0.930, and the feature selection via the Relief algorithm has the best performance in predicting the survival rate of AML patients.

Download Full-text

ENHANCING THE CLASSIFICATION ACCURACY OF NOISY DATASET BY FUSING CORRELATION BASED FEATURE SELECTION WITH K-NEAREST NEIGHBOUR

Oriental journal of computer science and technology ◽

10.13005/ojcst/10.02.05 ◽

2017 ◽

Vol 10 (2) ◽

pp. 282-290

Author(s):

Samir Singha ◽

Syed Hassan

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Dimensionality Reduction ◽

Classification Accuracy ◽

Missing Values ◽

Noisy Data ◽

Nearest Neighbour ◽

Prediction Ability ◽

Correlation Based Feature Selection

The performance of data mining and machine learning tasks can be significantly degraded due to the presence of noisy, irrelevant and high dimensional data containing large number of features. A large amount of real world data consist of noise or missing values. While collecting data, there may be many irrelevant features that are collected by the storage repositories. These redundant and irrelevant feature values distorts the classification principle and simultaneously increases calculations overhead and decreases the prediction ability of the classifier. The high-dimensionality of such datasets possesses major bottleneck in the field of data mining, statistics, machine learning. Among several methods of dimensionality reduction, attribute or feature selection technique is often used in dimensionality reduction. Since the k-NN algorithm is sensitive to irrelevant attributes therefore its performance degrades significantly when a dataset contains missing values or noisy data. However, this weakness of the k-NN algorithm can be minimized when combined with the other feature selection techniques. In this research we combine the Correlation based Feature Selection (CFS) with k-Nearest Neighbour (k-NN) Classification algorithm to find better result in classification when the dataset contains missing values or noisy data. The reduced attribute set decreases the time required for classification. The research shows that when dimensionality reduction is done using CFS and classified with k-NN algorithm, dataset with nil or very less noise may have negative impact in the classification accuracy, when compared with classification accuracy of k-NN algorithm alone. When additional noise is introduced to these datasets, the performance of k-NN degrades significantly. When these noisy datasets are classified using CFS and k-NN together, the percentage in classification accuracy is improved.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Download Full-text