Feature Selection Algorithm for Hyperlipidemia Classification

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text

A novel ensemble modeling for intrusion detection system

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp1963-1971 ◽

2020 ◽

Vol 10 (2) ◽

pp. 1963

Author(s):

Pullagura Indira Priyadarsini ◽

G. Anuradha

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Nearest Neighbor ◽

Detection System ◽

Distance Functions ◽

Classification Model ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set

Vast increase in data through internet services has made computer systems more vulnerable and difficult to protect from malicious attacks. Intrusion detection systems (IDSs) must be more potent in monitoring intrusions. Therefore an effectual Intrusion Detection system architecture is built which employs a facile classification model and generates low false alarm rates and high accuracy. Noticeably, IDS endure enormous amounts of data traffic that contain redundant and irrelevant features, which affect the performance of the IDS negatively. Despite good feature selection approaches leads to a reduction of unrelated and redundant features and attain better classification accuracy in IDS. This paper proposes a novel ensemble model for IDS based on two algorithms Fuzzy Ensemble Feature selection (FEFS) and Fusion of Multiple Classifier (FMC). FEFS is a unification of five feature scores. These scores are obtained by using feature-class distance functions. Aggregation is done using fuzzy union operation. On the other hand, the FMC is the fusion of three classifiers. It works based on Ensemble decisive function. Experiments were made on KDD cup 99 data set have shown that our proposed system works superior to well-known methods such as Support Vector Machines (SVMs), K-Nearest Neighbor (KNN) and Artificial Neural Networks (ANNs). Our examinations ensured clearly the prominence of using ensemble methodology for modeling IDSs. And hence our system is robust and efficient.

Download Full-text

Genetic Algorithm Ensemble Filter Methods on Kidney Disease Classification

International Journal of Innovative Computing ◽

10.11113/ijic.v11n2.345 ◽

2021 ◽

Vol 11 (2) ◽

pp. 73-80

Author(s):

Sharin Hazlin Huspi ◽

Chong Ke Ting

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Computational Cost ◽

Disease Classification ◽

Support Vector ◽

K Nearest Neighbor ◽

Fisher Score ◽

Filter Methods

Kidney failure will give effect to the human body, and it can lead to a series of seriously illness and even causing death. Machine learning plays important role in disease classification with high accuracy and shorter processing time as compared to clinical lab test. There are 24 attributes in the Chronic K idney Disease (CKD) clinical dataset, which is considered as too much of attributes. To improve the performance of the classification, filter feature selection methods used to reduce the dimensions of the feature and then the ensemble algorithm is used to identify the union features that selected from each filter feature selection. The filter feature selection that implemented in this research are Information Gain (IG), Chi-Squares, ReliefF and Fisher Score. Genetic Algorithm (GA) is used to select the best subset from the ensemble result of the filter feature selection. In this research, Random Forest (RF), XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes classification techniques were used to diagnose the CKD. The features subset that selected are different and specialised for each classifier. By implementing the proposed method irrelevant features through filter feature selection able to reduce the burden and computational cost for the genetic algorithm. Then, the genetic algorithm able to perform better and select the best subset that able to improve the performance of the classifier with less attributes. The proposed genetic algorithm union filter feature selections improve the performance of the classification algorithm. The accuracy of RF, XGBoost, KNN and SVM can achieve to 100% and NB can achieve to 99.17%. The proposed method successfully improves the performance of the classifier by using less features as compared to other previous work.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

A Comparison of the Analysis of Methods for Feature Extraction and Classification by Wavelet Transform in SSVEP BCIs

10.21203/rs.3.rs-82008/v1 ◽

2020 ◽

Author(s):

Hoda Heidari ◽

Zahra Einalou ◽

Mehrdad Dadgostar ◽

Hamidreza Hosseinzadeh

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Wavelet Transform ◽

Decision Tree ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Iir Filters ◽

Wide Range ◽

New Feature

Abstract Most of the studies in the field of Brain-Computer Interface (BCI) based on electroencephalography have a wide range of applications. Extracting Steady State Visual Evoked Potential (SSVEP) is regarded as one of the most useful tools in BCI systems. In this study, different methods such as feature extraction with different spectral methods (Shannon entropy, skewness, kurtosis, mean, variance) (bank of filters, narrow-bank IIR filters, and wavelet transform magnitude), feature selection performed by various methods (decision tree, principle component analysis (PCA), t-test, Wilcoxon, Receiver operating characteristic (ROC)), and classification step applying k nearest neighbor (k-NN), perceptron, support vector machines (SVM), Bayesian, multiple layer perceptron (MLP) were compared from the whole stream of signal processing. Through combining such methods, the effective overview of the study indicated the accuracy of classical methods. In addition, the present study relied on a rather new feature selection described by decision tree and PCA, which is used for the BCI-SSVEP systems. Finally, the obtained accuracies were calculated based on the four recorded frequencies representing four directions including right, left, up, and down.

Download Full-text

Recognition of Common Non-Normal Walking Actions Based on Relief-F Feature Selection and Relief-Bagging-SVM

Sensors ◽

10.3390/s20051447 ◽

2020 ◽

Vol 20 (5) ◽

pp. 1447

Author(s):

Pan Huang ◽

Yanping Li ◽

Xiaoyi Lv ◽

Wen Chen ◽

Shuxian Liu

Keyword(s):

Feature Selection ◽

Action Recognition ◽

Nearest Neighbor ◽

Health Indicators ◽

Support Vector ◽

Normal Walking ◽

K Nearest Neighbor ◽

Recognition Algorithms ◽

Medical Health ◽

Improved Algorithm

Action recognition algorithms are widely used in the fields of medical health and pedestrian dead reckoning (PDR). The classification and recognition of non-normal walking actions and normal walking actions are very important for improving the accuracy of medical health indicators and PDR steps. Existing motion recognition algorithms focus on the recognition of normal walking actions, and the recognition of non-normal walking actions common to daily life is incomplete or inaccurate, resulting in a low overall recognition accuracy. This paper proposes a microelectromechanical system (MEMS) action recognition method based on Relief-F feature selection and relief-bagging-support vector machine (SVM). Feature selection using the Relief-F algorithm reduces the dimensions by 16 and reduces the optimization time by an average of 9.55 s. Experiments show that the improved algorithm for identifying non-normal walking actions has an accuracy of 96.63%; compared with Decision Tree (DT), it increased by 11.63%; compared with k-nearest neighbor (KNN), it increased by 26.62%; and compared with random forest (RF), it increased by 11.63%. The average Area Under Curve (AUC) of the improved algorithm improved by 0.1143 compared to KNN, by 0.0235 compared to DT, and by 0.04 compared to RF.

Download Full-text

Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network

Diagnostics ◽

10.3390/diagnostics9030104 ◽

2019 ◽

Vol 9 (3) ◽

pp. 104 ◽

Cited By ~ 11

Author(s):

Ahmed ◽

Yigit ◽

Isik ◽

Alpkocak

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Leukemia Data

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.

Download Full-text

TOWARDS AN AUTOMATIC DIAGNOSIS SYSTEM FOR LUMBAR DISC HERNIATION: THE SIGNIFICANCE OF LOCAL SUBSET FEATURE SELECTION

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237218500448 ◽

2018 ◽

Vol 30 (06) ◽

pp. 1850044 ◽

Cited By ~ 1

Author(s):

Elias Ebrahimzadeh ◽

Farahnaz Fayaz ◽

Mehran Nikravan ◽

Fereshteh Ahmadi ◽

Mohammadjavad Rahimi Dolatabad

Keyword(s):

Feature Selection ◽

Lumbar Disc Herniation ◽

Disc Herniation ◽

Nearest Neighbor ◽

Lumbar Disc ◽

Support Vector ◽

K Nearest Neighbor ◽

Daily Lives ◽

Automatic Diagnosis ◽

Cad System

Herniation in the lumbar area is one of the most common diseases which results in lower back pain (LBP) causing discomfort and inconvenience in the patients’ daily lives. A computer aided diagnosis (CAD) system can be of immense benefit as it generates diagnostic results within a short time while increasing precision of diagnosis and eliminating human errors. We have proposed a new method for automatic diagnosis of lumbar disc herniation based on clinical MRI data. We use T2-W sagittal and myelograph images. The presented method has been applied on 30 clinical cases, each containing 7 discs (210 lumbar discs) for the herniation diagnosis. We employ Otsu thresholding method to extract the spinal cord from MR images of lumbar disc. A third order polynomial is then aligned on the extracted spinal cords, and by the end of preprocessing stage, all the T2-W sagittal images will have been prepared for specifying disc boundary and labeling. Having extracted an ROI for each disc, we proceed to use intensity and shape features for classification. The extracted features have been selected by Local Subset Feature Selection. The results demonstrated 91.90%, 92.38% and 95.23% accuracy for artificial neural network, K-nearest neighbor and support vector machine (SVM) classifiers respectively, indicating the superiority of the proposed method to those mentioned in similar studies.

Download Full-text

Feature Selection and K-nearest Neighbor for Diagnosis Cow Disease

International journal of science, engineering, and information technology ◽

10.21107/ijseit.v5i02.10218 ◽

2021 ◽

Vol 5 (02) ◽

pp. 249-253

Author(s):

Yeni Kustiyahningsih

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Disease Classification ◽

Training Data ◽

Test Results ◽

K Nearest Neighbor ◽

Data Set ◽

Cattle Disease ◽

Cattle Diseases ◽

Cattle Breeders

The large number of cattle population that exists can increase the potential for developing cow disease. Lack of knowledge about various kinds of cattle diseases and their handling solutions is one of the causes of decreasing cow productivity. The aim of this research is to classify cattle disease quickly and accurately to assist cattle breeders in accelerating detection and handling of cattle disease. This study uses K-Nearest Neighbour (KNN) classification method with the F-Score feature selection. The KNN method is used for disease classification based on the distance between training data and test data, while F-Score feature selection is used to reduce the attribute dimensions in order to obtain the relevant attributes. The data set used was data on cattle disease in Madura with a total of 350 data consisting of 21 features and 7 classes. Data were broken down using K-fold Cross Validation using k = 5. Based on the test results, the best accuracy was obtained with the number of features = 18 and KNN (k = 3) which resulted in an accuracy of 94.28571, a recall of 0.942857 and a precision of 0.942857.

Download Full-text

What factors determine reviewer credibility?

Kybernetes ◽

10.1108/k-08-2019-0537 ◽

2019 ◽

Vol 49 (10) ◽

pp. 2547-2567 ◽

Cited By ~ 1

Author(s):

Himanshu Sharma ◽

Anu G. Aggarwal

Keyword(s):

Nearest Neighbor ◽

Source Credibility ◽

Personal Information ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Content Type ◽

Linear Discriminant ◽

Travel And Tourism ◽

Hotel Booking

Purpose The experiential nature of travel and tourism services has popularized the importance of electronic word-of-mouth (EWOM) among potential customers. EWOM has a significant influence on hotel booking intention of customers as they tend to trust EWOM more than the messages spread by marketers. Amid abundant reviews available online, it becomes difficult for travelers to identify the most significant ones. This questions the credibility of reviewers as various online businesses allow reviewers to post their feedback using nickname or email address rather than using real name, photo or other personal information. Therefore, this study aims to determine the factors leading to reviewer credibility. Design/methodology/approach The paper proposes an econometric model to determine the variables that affect the reviewer’s credibility in the hospitality and tourism sector. The proposed model uses quantifiable variables of reviewers and reviews to estimate reviewer credibility, defined in terms of proportion of number of helpful votes received by a reviewer to the number of total reviews written by him. This covers both aspects of source credibility i.e. trustworthiness and expertness. The authors have used the data set of TripAdvisor.com to validate the models. Findings Regression analysis significantly validated the econometric models proposed here. To check the predictive efficiency of the models, predictive modeling using five commonly used classifiers such as random forest (RF), linear discriminant analysis, k-nearest neighbor, decision tree and support vector machine is performed. RF gave the best accuracy for the overall model. Practical implications The findings of this research paper suggest various implications for hoteliers and managers to help retain credible reviewers in the online travel community. This will help them to achieve long term relationships with the clients and increase their trust in the brand. Originality/value To the best of authors’ knowledge, this study performs an econometric modeling approach to find determinants of reviewer credibility, not conducted in previous studies. Moreover, the study contracts from earlier works by considering it to be an endogenous variable, rather than an exogenous one.

Download Full-text

Effect of information gain on document classification using k-nearest neighbor

10.26594/register.v8i1.2397 ◽

2022 ◽

Vol 8 (1) ◽

pp. 50

Author(s):

Rifki Indra Perwira ◽

Bambang Yuwono ◽

Risya Ines Putri Siswoyo ◽

Febri Liantoni ◽

Hidayatulah Himawan

Keyword(s):

Feature Selection ◽

Test Data ◽

Nearest Neighbor ◽

Intelligent System ◽

Information Gain ◽

Training Data ◽

State Universities ◽

Features Selection ◽

K Nearest Neighbor ◽

Support Students

State universities have a library as a facility to support students’ education and science, which contains various books, journals, and final assignments. An intelligent system for classifying documents is needed to ease library visitors in higher education as a form of service to students. The documents that are in the library are generally the result of research. Various complaints related to the imbalance of data texts and categories based on irrelevant document titles and words that have the ambiguity of meaning when searching for documents are the main reasons for the need for a classification system. This research uses k-Nearest Neighbor (k-NN) to categorize documents based on study interests with information gain features selection to handle unbalanced data and cosine similarity to measure the distance between test and training data. Based on the results of tests conducted with 276 training data, the highest results using the information gain selection feature using 80% training data and 20% test data produce an accuracy of 87.5% with a parameter value of k=5. The highest accuracy results of 92.9% are achieved without information gain feature selection, with the proportion of training data of 90% and 10% test data and parameters k=5, 7, and 9. This paper concludes that without information gain feature selection, the system has better accuracy than using the feature selection because every word in the document title is considered to have an essential role in forming the classification.

Download Full-text