scholarly journals Evaluation of the Performance for Popular Three Classifiers on Spam Email without using FS methods

2021 ◽  
Vol 16 ◽  
pp. 121-132
Author(s):  
Ghada AL-Rawashdeh ◽  
Rabiei Bin Mamat ◽  
Jawad Hammad Rawashdeh

Email is one of the most economical and fast communication means in recent years; however, there has been a high increase in the rate of spam emails in recent times due to the increased number of email users. Emails are mainly classified into spam and non-spam categories using data mining classification techniques. This paper provides a description and comparative for the evaluation of effective classifiers using three algorithms - namely k-nearest neighbor, Naive Bayesian, and support vector machine. Seven spam email datasets were used to conducted experiment in the MATLAB environment without using any feature selection method. The simulation results showed SVM classifier to achieve a better classification accuracy compared to the K-NN and NB.

Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


Author(s):  
Gang Liu ◽  
Chunlei Yang ◽  
Sen Liu ◽  
Chunbao Xiao ◽  
Bin Song

A feature selection method based on mutual information and support vector machine (SVM) is proposed in order to eliminate redundant feature and improve classification accuracy. First, local correlation between features and overall correlation is calculated by mutual information. The correlation reflects the information inclusion relationship between features, so the features are evaluated and redundant features are eliminated with analyzing the correlation. Subsequently, the concept of mean impact value (MIV) is defined and the influence degree of input variables on output variables for SVM network based on MIV is calculated. The importance weights of the features described with MIV are sorted by descending order. Finally, the SVM classifier is used to implement feature selection according to the classification accuracy of feature combination which takes MIV order of feature as a reference. The simulation experiments are carried out with three standard data sets of UCI, and the results show that this method can not only effectively reduce the feature dimension and high classification accuracy, but also ensure good robustness.


2010 ◽  
Vol 44-47 ◽  
pp. 1130-1134
Author(s):  
Sheng Li ◽  
Pei Lin Zhang ◽  
Bing Li

Feature selection is a key step in hydraulic system fault diagnosis. Some of the collected features are unrelated to classification model, and some are high correlated to other features. These features are harmful for establishing classification model. In order to solve this problem, genetic algorithm-partial least squares (GA-PLS) is proposed for selecting the representative and optimal features. K nearest neighbor algorithm (KNN) is used for diagnosing and classifying hydraulic system faults. For expressing better performance of GA-PLS, the original data of a model engineering hydraulic system is used, and the results of GA-PLS are compared with all feature used and GA. The experimental results show that, the proposed feature method can diagnose and classify hydraulic system faults more efficiently with using fewer features.


Author(s):  
SHITALA PRASAD ◽  
GYANENDRA K. VERMA ◽  
BHUPESH KUMAR SINGH ◽  
PIYUSH KUMAR

This paper, proposes a novel approach for feature extraction based on the segmentation and morphological alteration of handwritten multi-lingual characters. We explored multi-resolution and multi-directional transforms such as wavelet, curvelet and ridgelet transform to extract classifying features of handwritten multi-lingual images. Evaluating the pros and cons of each multi-resolution algorithm has been discussed and resolved that Curvelet-based features extraction is most promising for multi-lingual character recognition. We have also applied some morphological operation such as thinning and thickening then feature level fusion is performed in order to create robust feature vector for classification. The classification is performed with K-nearest neighbor (K-NN) and support vector machine (SVM) classifier with their relative performance. We experiment with our in-house dataset, compiled in our lab by more than 50 personnel.


2021 ◽  
Vol 15 ◽  
Author(s):  
Jingwen Feng ◽  
Bo Hu ◽  
Jingting Sun ◽  
Junpeng Zhang ◽  
Wen Wang ◽  
...  

Background: The use of social media daily could nurture a fragmented reading habit. However, little is known whether fragmented reading (FR) affects cognition and what are the underlying electroencephalogram (EEG) alterations it may lead to.Purpose: This study aimed to identify whether individuals have FR habits based on the single-trial EEG spectral features using machine learning (ML), as well as to find out the potential cognitive impairment induced by FR.Methods: Subjects were recruited through a questionnaire and divided into FR and noFR groups according to the time they spent on FR per day. Moreover, 64-channel EEG was acquired in Continuous Performance Task (CPT) and segmented into 0.5–1.5 s post-stimulus epochs under cue and background conditions. The sample sizes were as follows: FR in cue condition, 692 trials; noFR in cue condition, 688 trials; FR in background condition, 561 trials; noFR in background condition, 585 trials. For these single-trials, the relative power (RP) of six frequency bands [delta (1–3 Hz), theta (4–7 Hz), alpha (8–13 Hz), beta1 (14–20 Hz), beta2 (21–29 Hz), lower gamma (30–40 Hz)] were extracted as features. After feature selection, the most important feature sets were fed into three ML models, namely Support-Vector Machine (SVM), K-Nearest Neighbor (KNN), and Naive Bayes to perform the identification of FR. RP of six frequency bands was also used as feature sets to conduct classification tasks.Results: The classification accuracy reached up to 96.52% in the SVM model under cue conditions. Specifically, among six frequency bands, the most important features were found in alpha and gamma bands. Gamma achieved the highest classification accuracy (86.69% for cue, 86.45% for background). In both conditions, alpha RP in central sites of FR was stronger than noFR (p < 0.001). Gamma RP in the frontal site of FR was weaker than noFR in the background condition (p < 0.001), while alpha RP in parieto-occipital sites of FR was stronger than noFR in the cue condition (p < 0.001).Conclusion: Fragmented reading can be identified based on single-trial EEG evoked by CPT using ML, and the RP of alpha and gamma may reflect the impairment on attention and working memory by FR. FR might lead to cognitive impairment and is worth further exploration.


Author(s):  
M. Jupri ◽  
Riyanarto Sarno

The achievement of accepting optimal tax need effective and efficient tax supervision can be achieved by classifying taxpayer compliance to tax regulations. Considering this issue, this paper proposes the classification of taxpayer compliance using data mining algorithms; i.e. C4.5, Support Vector Machine, K-Nearest Neighbor, Naive Bayes, and Multilayer Perceptron based on the compliance of taxpayer data. The taxpayer compliance can be classified into four classes, which are (1) formal and material compliant taxpayers, (2) formal compliant taxpayers, (3) material compliant taxpayers, and (4) formal and material non-compliant taxpayers. Furthermore, the results of data mining algorithms are compared by using Fuzzy AHP and TOPSIS to determine the best performance classification based on the criteria of Accuracy, F-Score, and Time required. Selection of the taxpayer's priority for more detailed supervision at each level of taxpayer compliance is ranked using Fuzzy AHP and TOPSIS based on criteria of dataset variables. The results show that C4.5 is the best performance classification and achieves preference value of 0.998; whereas the MLP algorithm results from the lowest preference value of 0.131. Alternative taxpayer A233 is the top priority taxpayer with a preference value of 0.433; whereas alternative taxpayer A051 is the lowest priority taxpayer with a preference value of 0.036.


2022 ◽  
Vol 65 (1) ◽  
pp. 75-86
Author(s):  
Parth C. Upadhyay ◽  
John A. Lory ◽  
Guilherme N. DeSouza ◽  
Timotius A. P. Lagaunne ◽  
Christine M. Spinka

HighlightsA machine learning framework estimated residue cover in RGB images taken at three resolutions from 88 locations.The best results primarily used texture features, the RFE-SVM feature selection method, and the SVM classifier.Accounting for shadows and plants plus modifying and optimizing the texture features may improve performance.An automated system developed using machine learning is a viable strategy to estimate residue cover from RGB images obtained with handheld or UAV platforms.Abstract. Maintaining plant residue on the soil surface contributes to sustainable cultivation of arable land. Applying machine learning methods to RGB images of residue could overcome the subjectivity of manual methods. The objectives of this study were to use supervised machine learning while identifying the best feature selection method, the best classifier, and the most effective image feature types for classifying residue levels in RGB imagery. Imagery was collected from 88 locations in 40 row-crop fields in five Missouri counties between early May and late June in 2018 and 2019 using a tripod-mounted camera (0.014 cm pixel-1 ground sampling distance, GSD) and an unmanned aerial vehicle (UAV, 0.05 and 0.14 GSD). At each field location, 50 contiguous 0.3 × 0.2 m region of interest (ROI) images were extracted from the imagery, resulting in a dataset of 4,400 ROI images at each GSD. Residue percentages for ground truth were estimated using a bullseye grid method (n = 100 points) based on the 0.014 GSD images. Representative color, texture, and shape features were extracted and evaluated using four feature selection methods and two classifiers. Recursive feature elimination using support vector machine (RFE-SVM) was the best feature selection method, and the SVM classifier performed best for classifying the amount of residue as a three-class problem. The best features for this application were associated with texture, with local binary pattern (LBP) features being the most prevalent for all three GSDs. Shape features were irrelevant. The three residue classes were correctly identified with 88%, 84%, and 81% 10-fold cross-validation scores for the 2018 training data and 81%, 69%, and 65% accuracy for the 2019 testing data in decreasing resolution order. Converting image-wise data (0.014 GSD) to location residue estimates using a Bayesian model showed good agreement with the location-based ground truth (r2 = 0.90). This initial assessment documents the use of RGB images to match other methods of estimating residue, with potential to replace or be used as a quality control for line-transect assessments. Keywords: Feature selection, Soil erosion, Support vector machine, Texture features, Unmanned aerial vehicle.


2021 ◽  
Author(s):  
Chunyuan Wang ◽  
Yatao Zhang ◽  
Xinge Jiang ◽  
Feifei Liu ◽  
Zhimin Zhang ◽  
...  

Abstract This paper proposed a feature selection method combined with multi-time-scales analysis and heart rate variability (HRV) analysis for middle and early diagnosis of congestive heart failure (CHF). In previous studies regarding the diagnosis of CHF, researchers have tended to increase the variety of HRV features by searching for new ones or to use different machine learning algorithms to optimize the classification of CHF and normal sinus rhythms subject (NSR). In fact, the full utilization of traditional HRV features can also improve classification accuracy. The proposed method constructs a multi-time-scales feature matrix according to traditional HRV features that exhibit good stability in multiple time-scales and differences in different time-scales. The multi-scales features yield better performance than the traditional single-time-scales features when the features are fed into a support vector machine (SVM) classifier, and the results of the SVM classifier exhibit a sensitivity, a specificity, and an accuracy of 99.52%, 100.00%, and 99.83%, respectively. These results indicate that the proposed feature selection method can effectively reduce redundant features and computational load when used for automatic diagnosis of CHF.


2020 ◽  
Vol 10 (3) ◽  
pp. 769-774
Author(s):  
Shiliang Shao ◽  
Ting Wang ◽  
Chunhe Song ◽  
Yun Su ◽  
Xingchi Chen ◽  
...  

In this paper, eight novel instantaneous indices of short-time heart rate variability (HRV) signals are proposed for prediction of cardiovascular and cerebrovascular events. The indices are based on Bubble Entropy (BE) and Singular Value Decompose (SVD). The process of indices calculation is as follows, firstly, the instantaneous amplitude (IA), instantaneous frequency (IF) and instantaneous phase (IP) of HRV signals are estimated by the Hilbert transform. Secondly, according to the HRV, IA, IP and IF, the BE and singular value (SV) is calculated, then eight novel indices are obtained, they are BEHRV, BEIA, BEIF, BEIP, SVHRV, SVIA, SVIF and SVIP. Last but not least, in order to evaluate the performance of the eight novel indices for prediction of cardiovascular and cerebrovascular events, the difference analysis of eight indices is carried out by t-test. According to the p value, seven of the eight indices BEHRV, BEIA, BEIF, BEIP, SVIA, SVIF and SVIP are thought to be the indices to discriminate the E group and N group. The K-nearest neighbor (KNN), support vector machine (SVM) and decision tree (DT) are applied on the seven novel indices. The results are that, seven novel indices are significantly different between the events and non-events groups, and the SVM classifier has the highest classification Acc and Spe for prediction of cardiovascular and cerebrovascular events, they are 88.31% and 90.19%, respectively.


2012 ◽  
Vol 263-266 ◽  
pp. 1773-1777
Author(s):  
Hong Yu ◽  
Xiao Lei Huang ◽  
Zhi Ling Wei ◽  
Chen Xia Yang

Mining (classify or clustering) retrieval results to serve relevance feedback mechanism of search engine is an important solution to improve effectiveness of retrieval. Unlike plain text documents, since the XML documents are semi-structured data, for XML retrieval results classification, consider exploiting structure features of XML documents, such as tag paths and edges etc. We propose to use Support Vector Machine (SVM) classifier to classify XML retrieval results exploiting both their content and structure features. We implemented the classification method on XML retrieval results based on the IEEE SC corpus. Compared with k-nearest neighbor classification (KNN) on the same dataset in our application, SVM perform better. The experiment results have also shown that the use of structure features, especially tag paths and edges, can improve the classification performance significantly.


Sign in / Sign up

Export Citation Format

Share Document