scholarly journals E-Mail Spam Filtering

Author(s):  
Rohitkumar R Upadhyay

Abstract: E-mail is that the most typical method of communication because of its ability to get, the rapid modification of messages and low cost of distribution. E-mail is one among the foremost secure medium for online communication and transferring data or messages through the net. An overgrowing increase in popularity, the quantity of unsolicited data has also increased rapidly. Spam causes traffic issues and bottlenecks that limit the quantity of memory and bandwidth, power and computing speed. To filtering data, different approaches exist which automatically detect and take away these untenable messages. There are several numbers of email spam filtering technique like Knowledge-based technique, Clustering techniques, Learning-based technique, Heuristic processes so on. For data filtering, various approaches exist that automatically detect and suppress these indefensible messages. This paper illustrates a survey of various existing email spam filtering system regarding Machine Learning Technique (MLT) like Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. Henceforth here we give the classification, evaluation and comparison of some email spam filtering system and summarize the scenario regarding accuracy rate of various existing approaches. Keywords: e-mail spam, unsolicited bulk email, spam filtering methods.

2007 ◽  
Vol 16 (04) ◽  
pp. 627-646 ◽  
Author(s):  
YAN ZHOU ◽  
MADHURI S. MULEKAR ◽  
PRAVEEN NERELLAPALLI

Unsolicited bulk e-mail, also known as spam, has been an increasing problem for the e-mail society. This paper presents a new spam filtering strategy that 1) uses a practical entropy coding technique, Huffman coding, to dynamically encode the feature space of the e-mail collected over time and, 2) applies an online algorithm to adaptively enhance the learned spam concept as new e-mail data becomes available. The contributions of this work include a highly efficient spam filtering algorithm in which the input space is radically reduced to a single-dimension input vector, and an adaptive learning technique that is robust to vocabulary change, concept drifting and skewed class distributions. We compare our technique with several existing off-line learning techniques including support vector machine, logistic regression, naïve Bayes, k-nearest neighbor, C4.5 decision tree, RBFNetwork, boosted decision tree and stacking. We demonstrate the effectiveness of our technique by presenting the experimental results on the e-mail data that is publicly available. A more in-depth statistical analysis on the experimental results is also presented and discussed.


Author(s):  
I Made Oka Widyantara ◽  
I Made Dwi Asana Putra ◽  
Ida Bagus Putu Adnyana

This paper intends to explain the development of Coastal Video Monitoring System (CoViMoS) with the main characteristics including low-cost and easy implementation. CoViMoS characteristics have been realized using the device IP camera for video image acquisition, and development of software applications with the main features including detection of shoreline and it changes are automatically. This capability was based on segmentation and classification techniques based on data mining. Detection of shoreline is done by segmenting a video image of the beach, to get a cluster of objects, namely land, sea and sky, using Self Organizing Map (SOM) algorithms. The mechanism of classification is done using K-Nearest Neighbor (K-NN) algorithms to provide the class labels to objects that have been generated on the segmentation process. Furthermore, the classification of land used as a reference object in the detection of costline. Implementation CoViMoS system for monitoring systems in Cucukan Beach, Gianyar regency, have shown that the developed system is able to detect the shoreline and its changes automatically.


Author(s):  
Wasan Shaker Awad ◽  
Wafa M. Rafiq

Email is the most popular choice of communication due to its low-cost and easy accessibility, which makes email spam a major issue. Emails can be incorrectly marked by a spam filter and legitimate emails can get lost in the spam folder or the spam emails can deluge the users' inboxes. Therefore, various methods based on statistics and machine learning have been developed to classify emails accurately. In this chapter, the existing spam filtering methods were studied comprehensively, and a spam email classifier based on the genetic algorithm was proposed. The proposed algorithm was successful in achieving high accuracy by reducing the rate of false positives, but at the same time, it also maintained an acceptable rate of false negatives. The proposed algorithm was tested on 2000 emails from the two popular spam datasets, Enron and LingSpam, and the accuracy was found to be nearly 90%. The results showed that the genetic algorithm is an effective method for spam classification and with further enhancements that will provide a more robust spam filter.


Author(s):  
Abdelkrim Latreche ◽  
Kadda Benyahia

Electronic mail has become one of the most popular and frequently used channels for personal and professional online communication. Despite its benefits, e-mail faces a major security problem, which is the daily reception of a large number of unsolicited electronic messages, known as “spam emails.” Today, most electronic mail systems have simple spam filtering mechanisms based on text spam filtering technologies. To circumvent these filters, spammers are introducing new techniques of embedding spam messages in the image attached to the mail. In this article, the authors propose a new method for spam image filtering. The proposed system can distinguish between legitimate images from spam images based on the texture characteristics of the image attached to an email. From each image, around 20 characteristics can be extracted from the gray level co-occurrence matrix (GLCM). Then, to classify the images as spam or ham, the authors use a new metaheuristic nature-inspired model for building classifiers based on the social worker bees and enhanced nearest-centroid classification method.


2016 ◽  
Vol 12 (1) ◽  
pp. 96-102 ◽  
Author(s):  
Hayder AL-Behadili

Data-intensive science is a critical science paradigm that interferes with all other sciences. Data mining (DM) is a powerful and useful technology with wide potential users focusing on important meaningful patterns and discovers a new knowledge from a collected dataset. Any predictive task in DM uses some attribute to classify an unknown class. Classification algorithms are a class of prominent mathematical techniques in DM. Constructing a model is the core aspect of such algorithms. However, their performance highly depends on the algorithm behavior upon manipulating data. Focusing on binarazaition as an approach for preprocessing, this paper analysis and evaluates different classification algorithms when construct a model based on accuracy in the classification task. The Mixed National Institute of Standards and Technology (MNIST) handwritten digits dataset provided by Yann LeCun has been used in evaluation. The paper focuses on machine learning approaches for handwritten digits detection. Machine learning establishes classification methods, such as K-Nearest Neighbor(KNN), Decision Tree (DT), and Neural Networks (NN). Results showed that the knowledge-based method, i.e. NN algorithm, is more accurate in determining the digits as it reduces the error rate. The implication of this evaluation is providing essential insights for computer scientists and practitioners for choosing the suitable DM technique that fit with their data.


Author(s):  
Reza Safdari ◽  
Peyman Rezaei-Hachesu ◽  
Marjan GhaziSaeedi ◽  
Taha Samad-Soltani ◽  
Maryam Zolnoori

Medical data mining intends to solve real-world problems in the diagnosis and treatment of diseases. This process applies various techniques and algorithms which have different levels of accuracy and precision. The purpose of this article is to apply data mining techniques to the diagnosis of asthma. Sensitivity, specificity and accuracy of K-nearest neighbor, Support Vector Machine, naive Bayes, Artificial Neural Network, classification tree, CN2 algorithms, and related similar studies were evaluated. ROC curves were plotted to show the performance of the authors' approach. Support vector machine (SVM) algorithms achieved the highest accuracy at 98.59% with a sensitivity of 98.59% and a specificity of 98.61% for class 1. Other algorithms had a range of accuracy greater than 87%. The results show that the authors can accurately diagnose asthma approximately 98% of the time based on demographics and clinical data. The study also has a higher sensitivity when compared to expert and knowledge-based systems.


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2769
Author(s):  
Jingjing Wang ◽  
Joongoo Park

Received signal strength indication (RSSI) obtained by Medium Access Control (MAC) layer is widely used in range-based and fingerprint location systems due to its low cost and low complexity. However, RSS is affected by noise signals and multi-path, and its positioning performance is not stable. In recent years, many commercial WiFi devices support the acquisition of physical layer channel state information (CSI). CSI is an index that can characterize the signal characteristics with more fine granularity than RSS. Compared with RSS, CSI can avoid the effects of multi-path and noise by analyzing the characteristics of multi-channel sub-carriers. To improve the indoor location accuracy and algorithm efficiency, this paper proposes a hybrid fingerprint location technology based on RSS and CSI. In the off-line phase, to overcome the problems of low positioning accuracy and fingerprint drift caused by signal instability, a methodology based on the Kalman filter and a Gaussian function is proposed to preprocess the RSSI value and CSI amplitude value, and the improved CSI phase is incorporated after the linear transformation. The mutation and noisy data are then effectively eliminated, and the accurate and smoother outputs of the RSSI and CSI values can be achieved. Then, the accurate hybrid fingerprint database is established after dimensionality reduction of the obtained high-dimensional data values. The weighted k-nearest neighbor (WKNN) algorithm is applied to reduce the complexity of the algorithm during the online positioning stage, and the accurate indoor positioning algorithm is accomplished. Experimental results show that the proposed algorithm exhibits good performance on anti-noise ability, fusion positioning accuracy, and real-time filtering. Compared with CSI-MIMO, FIFS, and RSSI-based methods, the proposed fusion correction method has higher positioning accuracy and smaller positioning error.


2018 ◽  
Vol 16 (2) ◽  
pp. e0203 ◽  
Author(s):  
Xuping Feng ◽  
Haijun Yin ◽  
Chu Zhang ◽  
Cheng Peng ◽  
Yong He

The applicability of near infrared (NIR) spectroscopy combined with chemometrics was examined to develop fast, low-cost and non-destructive spectroscopic methods for classification of transgenic maize plants. The transgenic maize plants containing both cry1Ab/cry2Aj-G10evo proteins and their non-transgenic parent were measured in the NIR diffuse reflectance mode with the spectral range of 700–1900 nm. Three variable selection algorithms, including weighted regression coefficients, principal component analysis -loadings and second derivatives were used to extract sensitive wavelengths that contributed the most discrimination information for these genotypes. Five classification methods, including K-nearest neighbor, Soft Independent Modeling of Class Analogy, Naive Bayes Classifier, Extreme Learning Machine (ELM) and Radial Basis Function Neural Network were used to build discrimination models based on the preprocessed full spectra and sensitive wavelengths. The results demonstrated that ELM had the best performance of all methods, even though the model’s recognition ability decreased as the variables in the training of neural networks were reduced by using only the sensitive wavelengths. The ELM model calculated on the calibration set showed classification rates of 100% based on the full spectrum and 90.83% based on sensitive wavelengths. The NIR spectroscopy combined with chemometrics offers a powerful tool for evaluating large number of samples from maize hybrid performance trials and breeding programs.


2020 ◽  
Vol 3 (2) ◽  
pp. 35-46
Author(s):  
Shereen S. Jumaa ◽  
Khamis A. Zidan

One of the safest biometrics of today is finger vein- but this technic  arises with some specific challenges, the most common  one being that the vein pattern is hard to extract because finger vein images are always low in quality, significantly  hampered the feature extraction and classification stages. Professional  algorithms want to be considered with the conventional hardware for capturing finger-vein images is  by using red Surface Mounted Diode (SMD) leds for this aim. For capturing images, Canon 750D camera with micro lens is used. For high quality images the integrated micro lens  is used, and with some adjustments it can also obtain finger print. Features extraction was used by a combination of Hierarchical Centroid and Histogram of Gradients. Results were evaluated with K Nearest Neighbor and Deep Neural Networks using 6 fold stratified cross validation. Results displayed improvement as compared to three latest benchmarks in this field that used 6-fold validation and SDUMLA-HMT. The work novelty is owing to the hardware design of the sensor within the finger-vein recognition system to obtain, simultaneously, highly secured recognition with low computation time ,finger vein and finger print at low cost, unlimited users for one device and open source.


Sign in / Sign up

Export Citation Format

Share Document