Optimization of K Value in KNN Algorithm for Spam and Ham Email Classification

Eko Laksono; Achmad Basuki; Fitra Bachtiar

doi:10.29207/resti.v4i2.1845

Optimization of K Value in KNN Algorithm for Spam and Ham Email Classification

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i2.1845 ◽

2020 ◽

Vol 4 (2) ◽

pp. 377-383

Author(s):

Eko Laksono ◽

Achmad Basuki ◽

Fitra Bachtiar

Keyword(s):

Frequency Distribution ◽

Confusion Matrix ◽

High Accuracy ◽

K Value ◽

A Value ◽

Optimal Value ◽

Classification Evaluation ◽

Email Spam ◽

Email Classification

There are many cases of email abuse that have the potential to harm others. This email abuse is commonly known as spam, which contains advertisements, phishing scams, and even malware. This study purpose to know the classification of email spam with ham using the KNN method as an effort to reduce the amount of spam. KNN can classify spam or ham in an email by checking it using a different K value approach. The results of the classification evaluation using confusion matrix resulted in the KNN method with a value of K = 1 having the highest accuracy value of 91.4%. From the results of the study, it is known that the optimization of the K value in KNN using frequency distribution clustering can produce high accuracy of 100%, while k-means clustering produces an accuracy of 99%. So based on the results of the existing accuracy values, the frequency distribution clustering and k-means clustering can be used to optimize the K-optimal value of the KNN in the classification of existing spam emails.

Download Full-text

Naive Bayes for Thesis Labeling

Mobile and Forensics ◽

10.12928/mf.v3i1.3763 ◽

2021 ◽

Vol 3 (1) ◽

pp. 6-16

Author(s):

Fitria Nurhayati ◽

Arfiani Nur Khusna ◽

Dimas Chaerul Ekty Saputra

Keyword(s):

Intelligent Systems ◽

Naive Bayes ◽

Confusion Matrix ◽

Naïve Bayes ◽

Training Data ◽

Student Interest ◽

K Value ◽

A Value ◽

Data Engineering ◽

Areas Of Interest

The thesis preparation in the Department of Informatics Universitas Ahmad Dahlan is divided into two areas of interest, namely Intelligent Systems and Software and Data Engineering. Existing thesis title data is only used as an archive and has never been processed or classified to determine the trend of thesis topics based on student interest each year. The stages include data collection, the data is divided into two parts (training data and test data), manual labeling of training data, text preprocessing, and classification using Naive Bayes. The results show the trend of thesis title taking from 2013 to 2018 shows the thesis trend in the field of Intelligent Systems and Software. Accuracy testing uses Confusion Matrix and K-Fold Cross Validation with a k value is 10, has a value of 94.60%, precision of 97.30%, and a recall of 85.70%.

Download Full-text

Reliable Crops Classification Using Limited Number of Sentinel-2 and Sentinel-1 Images

Remote Sensing ◽

10.3390/rs13163176 ◽

2021 ◽

Vol 13 (16) ◽

pp. 3176

Author(s):

Beata Hejmanowska ◽

Piotr Kramarczyk ◽

Ewa Głowienka ◽

Sławomir Mikrut

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confusion Matrix ◽

High Accuracy ◽

Training Set ◽

Validation Data ◽

Comparative Accuracy ◽

The Difference ◽

Sentinel 2

The study presents the analysis of the possible use of limited number of the Sentinel-2 and Sentinel-1 to check if crop declarations that the EU farmers submit to receive subsidies are true. The declarations used in the research were randomly divided into two independent sets (training and test). Based on the training set, supervised classification of both single images and their combinations was performed using random forest algorithm in SNAP (ESA) and our own Python scripts. A comparative accuracy analysis was performed on the basis of two forms of confusion matrix (full confusion matrix commonly used in remote sensing and binary confusion matrix used in machine learning) and various accuracy metrics (overall accuracy, accuracy, specificity, sensitivity, etc.). The highest overall accuracy (81%) was obtained in the simultaneous classification of multitemporal images (three Sentinel-2 and one Sentinel-1). An unexpectedly high accuracy (79%) was achieved in the classification of one Sentinel-2 image at the end of May 2018. Noteworthy is the fact that the accuracy of the random forest method trained on the entire training set is equal 80% while using the sampling method ca. 50%. Based on the analysis of various accuracy metrics, it can be concluded that the metrics used in machine learning, for example: specificity and accuracy, are always higher then the overall accuracy. These metrics should be used with caution, because unlike the overall accuracy, to calculate these metrics, not only true positives but also false positives are used as positive results, giving the impression of higher accuracy. Correct calculation of overall accuracy values is essential for comparative analyzes. Reporting the mean accuracy value for the classes as overall accuracy gives a false impression of high accuracy. In our case, the difference was 10–16% for the validation data, and 25–45% for the test data.

Download Full-text

Quantified Explainability: Convolutional Neural Network Focus Assessment in Arrhythmia Detection

10.21203/rs.3.rs-666509/v1 ◽

2021 ◽

Author(s):

Rui Varandas ◽

Bernardo Brás Gonçalves ◽

Hugo Gamboa ◽

Pedro Vieira

Keyword(s):

Image Representation ◽

Region Of Interest ◽

High Accuracy ◽

Black Box ◽

Classification Models ◽

Medical Field ◽

Worst Case ◽

Arrhythmia Detection ◽

A Value

Abstract Background: Deep Learning (DL) models are able to produce accurate results in various areas. However, the medical field is specially sensitive, because every decision should be reliable and explained to the stakeholders. Thus, the high accuracy of DL models pose a great advantage, but the fact that they function as a black-box hinders their application to sensitive fields, given that they are not explainable per se. Hence, the application of explainability methods became important to provide explaination DL models in various problems. In this work, we trained different classifiers and generated explanation of their classification of electrocardiograms (ECG) by applying well-known methods. Finally, we extract quantifiable information to evaluate the explanation of our classifiers. Methods: In this study two datasets were built consisting of image representation of ECG that were labelled given one specific heartbeat: 1. labelled given the last heartbeat and 2. labelled given the first heartbeat. DL models were trained with each dataset. Three different explainability methods were applied to the DL models to explain their classification. These methods produce attribution maps in which the intensity of the pixels are proportional to their importance to the classification task. Thus, we developed a metric to quantify the focus of the models in the region of interest (ROI) of the ECG representation. Results: The developed classification models achieved accuracy scores of around 93.66% and 91.72% in the testing set. The explainability methods were successfully applied to these models. The quantification metric developed in this work demonstrated that, in most cases, the models did have a focus around the heartbeat of interest. The results ranged from around 8.8% in the worst case, until 32.4%, where the random focus would mean a value of approximately 10%. Conclusions: The classification models performed accurately in the two datasets, however, even though their focus is higher in the ROI of the figures compared with the random case, the results allow the interpretation that other regions of the figures might also be important for classification. In the future, it should be investigated the importance of regions outside the ROI and also if specific waves of the ECG signal contribute to the classification.

Download Full-text

Naïve Bayes Algorithm for Classification of Student Major’s Specialization

Journal of Intelligent Computing & Health Informatics ◽

10.26714/jichi.v1i1.5570 ◽

2020 ◽

Vol 1 (1) ◽

pp. 17

Author(s):

Astia Weni Syaputri ◽

Erno Irwandi ◽

Mustakim Mustakim

Keyword(s):

Social Sciences ◽

Naive Bayes ◽

Confusion Matrix ◽

High Accuracy ◽

Naïve Bayes ◽

Natural Sciences ◽

Training Data ◽

Average Value ◽

Bayes Algorithm

Majors are important in determining student specialization. If there is an error in the direction of the student, it will certainly affect the education of subsequent students. In SMA Negeri 1 Kampar Timur, there are two majors, namely Natural Sciences and Social Sciences. To determine these majors, it is necessary to reference the average value of student grades from semester 3 to semester 5 which includes the average value of Islamic religious education, Indonesian, Citizenship Education, English, Natural Sciences, Social Sciences, and Mathematics. Naive Beyes algorithm is an algorithm that can be used in classifying majors found in SMA Negeri 1 Kampar Timur. To determine the classification of majors in SMA Negeri 1 Kampar Timur, training data and test data are used, respectively at 70% and 30%. This data will be tested for accuracy using a confusion matrix and produces a fairly high accuracy of 96.19%. With this high accuracy, the Naive Bayes algorithm is very suitable to be used in determining the direction of students in SMA Negeri 1 Kampar Timur.

Download Full-text

Estimating Pigment Concentrations from Spectral Images Using an Encoder‐Decoder Neural Network

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.3.030502 ◽

2020 ◽

Vol 64 (3) ◽

pp. 30502-1-30502-15

Author(s):

Kensuke Fukumoto ◽

Norimichi Tsumura ◽

Roy Berns

Keyword(s):

Neural Network ◽

Neural Networks ◽

Absorption Coefficient ◽

Spectral Data ◽

High Accuracy ◽

Pigment Concentration ◽

Scattering Coefficient ◽

A Value ◽

Input And Output ◽

Pigment Concentrations

Abstract A method is proposed to estimate the concentration of pigments mixed in a painting, using the encoder‐decoder model of neural networks. The model is trained to output a value that is the same as its input, and its middle output extracts a certain feature as compressed information about the input. In this instance, the input and output are spectral data of a painting. The model is trained with pigment concentration as the middle output. A dataset containing the scattering coefficient and absorption coefficient of each of 19 pigments was used. The Kubelka‐Munk theory was applied to the coefficients to obtain many patterns of synthetic spectral data, which were used for training. The proposed method was tested using spectral images of 33 paintings, which showed that the method estimates, with high accuracy, the concentrations that have a similar spectrum of the target pigments.

Download Full-text

Artificial Intelligence in Fractured Dental Implant Detection and Classification: Evaluation Using Dataset from Two Dental Hospitals

Diagnostics ◽

10.3390/diagnostics11020233 ◽

2021 ◽

Vol 11 (2) ◽

pp. 233

Author(s):

Dong-Woon Lee ◽

Sung-Yong Kim ◽

Seong-Nyum Jeong ◽

Jae-Hong Lee

Keyword(s):

Dental Implant ◽

Classification Accuracy ◽

Reliability And Validity ◽

Mechanical Complication ◽

Radiographic Images ◽

Acceptable Accuracy ◽

Reliable Detection ◽

Classification Evaluation ◽

Accuracy Performance

Fracture of a dental implant (DI) is a rare mechanical complication that is a critical cause of DI failure and explantation. The purpose of this study was to evaluate the reliability and validity of a three different deep convolutional neural network (DCNN) architectures (VGGNet-19, GoogLeNet Inception-v3, and automated DCNN) for the detection and classification of fractured DI using panoramic and periapical radiographic images. A total of 21,398 DIs were reviewed at two dental hospitals, and 251 intact and 194 fractured DI radiographic images were identified and included as the dataset in this study. All three DCNN architectures achieved a fractured DI detection and classification accuracy of over 0.80 AUC. In particular, automated DCNN architecture using periapical images showed the highest and most reliable detection (AUC = 0.984, 95% CI = 0.900–1.000) and classification (AUC = 0.869, 95% CI = 0.778–0.929) accuracy performance compared to fine-tuned and pre-trained VGGNet-19 and GoogLeNet Inception-v3 architectures. The three DCNN architectures showed acceptable accuracy in the detection and classification of fractured DIs, with the best accuracy performance achieved by the automated DCNN architecture using only periapical images.

Download Full-text

Classification of cardiac arrhythmias using Zhao-Atlas-Marks time-frequency distribution

Multimedia Tools and Applications ◽

10.1007/s11042-021-10945-6 ◽

2021 ◽

Author(s):

Fulya Akdeniz ◽

İlknur Kayikcioglu ◽

Temel Kayikcioglu

Keyword(s):

Frequency Distribution ◽

Cardiac Arrhythmias ◽

Time Frequency ◽

Time Frequency Distribution

Download Full-text

Feature Extraction and Classification of Citrus Juice by Using an Enhanced L-KSVD on Data Obtained from Electronic Nose

Sensors ◽

10.3390/s19040916 ◽

2019 ◽

Vol 19 (4) ◽

pp. 916 ◽

Cited By ~ 2

Author(s):

Wen Cao ◽

Chunmei Liu ◽

Pengfei Jia

Keyword(s):

Feature Extraction ◽

Kernel Function ◽

Electronic Nose ◽

Classification Accuracy ◽

Extraction Methods ◽

Object Function ◽

Optimal Value ◽

Processed Products

Aroma plays a significant role in the quality of citrus fruits and processed products. The detection and analysis of citrus volatiles can be measured by an electronic nose (E-nose); in this paper, an E-nose is employed to classify the juice which is stored for different days. Feature extraction and classification are two important requirements for an E-nose. During the training process, a classifier can optimize its own parameters to achieve a better classification accuracy but cannot decide its input data which is treated by feature extraction methods, so the classification result is not always ideal. Label consistent KSVD (L-KSVD) is a novel technique which can extract the feature and classify the data at the same time, and such an operation can improve the classification accuracy. We propose an enhanced L-KSVD called E-LCKSVD for E-nose in this paper. During E-LCKSVD, we introduce a kernel function to the traditional L-KSVD and present a new initialization technique of its dictionary; finally, the weighted coefficients of different parts of its object function is studied, and enhanced quantum-behaved particle swarm optimization (EQPSO) is employed to optimize these coefficients. During the experimental section, we firstly find the classification accuracy of KSVD, and L-KSVD is improved with the help of the kernel function; this can prove that their ability of dealing nonlinear data is improved. Then, we compare the results of different dictionary initialization techniques and prove our proposed method is better. Finally, we find the optimal value of the weighted coefficients of the object function of E-LCKSVD that can make E-nose reach a better performance.

Download Full-text

A Machine Vision Approach for Bioreactor Foam Sensing

SLAS TECHNOLOGY Translating Life Sciences Innovation ◽

10.1177/24726303211008861 ◽

2021 ◽

pp. 247263032110088

Author(s):

Jonas Austerjost ◽

Robert Söldner ◽

Christoffer Edlund ◽

Johan Trygg ◽

David Pollard ◽

...

Keyword(s):

Machine Learning ◽

Machine Vision ◽

State Of The Art ◽

Low Cost ◽

High Accuracy ◽

Consumer Electronics ◽

Learning System ◽

Automotive Applications ◽

Fine Grained

Machine vision is a powerful technology that has become increasingly popular and accurate during the last decade due to rapid advances in the field of machine learning. The majority of machine vision applications are currently found in consumer electronics, automotive applications, and quality control, yet the potential for bioprocessing applications is tremendous. For instance, detecting and controlling foam emergence is important for all upstream bioprocesses, but the lack of robust foam sensing often leads to batch failures from foam-outs or overaddition of antifoam agents. Here, we report a new low-cost, flexible, and reliable foam sensor concept for bioreactor applications. The concept applies convolutional neural networks (CNNs), a state-of-the-art machine learning system for image processing. The implemented method shows high accuracy for both binary foam detection (foam/no foam) and fine-grained classification of foam levels.

Download Full-text

Towards a Unified Sentiment Lexicon Based on Graphics Processing Units

Mathematical Problems in Engineering ◽

10.1155/2014/429629 ◽

2014 ◽

Vol 2014 ◽

pp. 1-19

Author(s):

Liliana Ibeth Barbosa-Santillán ◽

Inmaculada Álvarez-de-Mon y-Rego

Keyword(s):

Graphics Processing Units ◽

Pearson Correlation ◽

Lexical Entry ◽

A Value ◽

Sentiment Lexicon ◽

Unit Of Measurement ◽

Time Required ◽

Graphics Processing ◽

High Processing

This paper presents an approach to create what we have called a Unified Sentiment Lexicon (USL). This approach aims at aligning, unifying, and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. One problem related to the task of the automatic unification of different scores of sentiment lexicons is that there are multiple lexical entries for which the classification of positive, negative, or neutral{P,N,Z}depends on the unit of measurement used in the annotation methodology of the source sentiment lexicon. Our USL approach computes the unified strength of polarity of each lexical entry based on the Pearson correlation coefficient which measures how correlated lexical entries are with a value between 1 and −1, where 1 indicates that the lexical entries are perfectly correlated, 0 indicates no correlation, and −1 means they are perfectly inversely correlated and so is the UnifiedMetrics procedure for CPU and GPU, respectively. Another problem is the high processing time required for computing all the lexical entries in the unification task. Thus, the USL approach computes a subset of lexical entries in each of the 1344 GPU cores and uses parallel processing in order to unify 155802 lexical entries. The results of the analysis conducted using the USL approach show that the USL has 95.430 lexical entries, out of which there are 35.201 considered to be positive, 22.029 negative, and 38.200 neutral. Finally, the runtime was 10 minutes for 95.430 lexical entries; this allows a reduction of the time computing for the UnifiedMetrics by 3 times.

Download Full-text