Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising

Deepti Sisodia; Dilip Singh Sisodia

doi:10.1108/dta-09-2021-0233

Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising

Data Technologies and Applications ◽

10.1108/dta-09-2021-0233 ◽

2022 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Deepti Sisodia ◽

Dilip Singh Sisodia

Keyword(s):

Feature Selection ◽

Online Advertising ◽

Feature Selection Method ◽

Majority Voting ◽

Feature Subset ◽

Relevant Feature ◽

Selection Methods ◽

Content Type ◽

Optimal Feature Subset ◽

Optimal Feature

PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.

Download Full-text

Accelerated Simulated Annealing and Mutation Operator Feature Selection method for Big Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1712.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 910-916

Keyword(s):

Feature Selection ◽

Simulated Annealing ◽

Feature Selection Method ◽

Classification Problem ◽

Feature Subset Selection ◽

Feature Subset ◽

Mutation Operator ◽

Knn Classifier ◽

Optimal Feature Subset ◽

Optimal Feature

The optimal feature subset selection over very high dimensional data is a vital issue. Even though the optimal features are selected, the classification of those selected features becomes a key complicated task. In order to handle these problems, a novel, Accelerated Simulated Annealing and Mutation Operator (ASAMO) feature selection algorithm is suggested in this work. For solving the classification problem, the Fuzzy Minimal Consistent Class Subset Coverage (FMCCSC) problem is introduced. In FMCCSC, consistent subset is combined with the K-Nearest Neighbour (KNN) classifier known as FMCCSC-KNN classifier. The two data sets Dorothea and Madelon from UCI machine repository are experimented for optimal feature selection and classification. The experimental results substantiate the efficiency of proposed ASAMO with FMCCSC-KNN classifier compared to Particle Swarm Optimization (PSO) and Accelerated PSO feature selection algorithms.

Download Full-text

iAIPs: Identifying Anti-Inflammatory Peptides Using Random Forest

Frontiers in Genetics ◽

10.3389/fgene.2021.773202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Dongxu Zhao ◽

Zhixia Teng ◽

Yanjuan Li ◽

Dong Chen

Keyword(s):

Feature Selection ◽

Amino Acid ◽

Random Forest ◽

Feature Selection Method ◽

Selection Strategy ◽

Feature Subset ◽

Evolutionary Analysis ◽

Anti Inflammatory ◽

Optimal Feature Subset ◽

Optimal Feature

Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.

Download Full-text

Research of Red Tide Algae Images Feature Selection Method Based on ReliefF and SBS

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.507.806 ◽

2014 ◽

Vol 507 ◽

pp. 806-809

Author(s):

Shu Fang Li ◽

Qin Jia ◽

Hong Liang

Keyword(s):

Feature Selection ◽

Red Tide ◽

Feature Selection Method ◽

Original Data ◽

Feature Subset ◽

Data Set ◽

Before And After ◽

Optimal Feature Subset ◽

Optimal Feature ◽

Original Feature

In order to Red Tide algae present real-time automatic classification method of high accuracy rate, this paper proposes using ReliefF-SBS for feature selection. Namely feature analysis about Red Tide algae image original data set. And on this basis, feature selection to remove the irrelevant features and redundant features from the original feature set feature, to get the optimal feature subset, and reduce their impact on the classification accuracy. Meanwhile compare the classification results before and after SVM and KNN two kinds feature selection classifiers.

Download Full-text

A Multilevel Optimal Feature Selection and Ensemble Learning for a Specific CAD System-Pulmonary Nodule Detection

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.1593 ◽

2013 ◽

Vol 380-384 ◽

pp. 1593-1599

Author(s):

Hao Yan Guo ◽

Da Zheng Wang

Keyword(s):

Feature Selection ◽

Learning Algorithm ◽

Feature Selection Method ◽

Feature Subset ◽

Cad System ◽

Optimal Feature Selection ◽

Optimal Feature Subset ◽

Feature Selection Approach ◽

Selection Algorithms ◽

Optimal Feature

The traditional motivation behind feature selection algorithms is to find the best subset of features for a task using one particular learning algorithm. However, it has been often found that no single classifier is entirely satisfactory for a particular task. Therefore, how to further improve the performance of these single systems on the basis of the previous optimal feature subset is a very important issue.We investigate the notion of optimal feature selection and present a practical feature selection approach that is based on an optimal feature subset of a single CAD system, which is referred to as a multilevel optimal feature selection method (MOFS) in this paper. Through MOFS, we select the different optimal feature subsets in order to eliminate features that are redundant or irrelevant and obtain optimal features.

Download Full-text

A Hybrid Feature Selection Method for Effective Data Classification in Data Mining Applications

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2019010101 ◽

2019 ◽

Vol 11 (1) ◽

pp. 1-16

Author(s):

Ilangovan Sangaiya ◽

A. Vincent Antony Kumar

Keyword(s):

Data Mining ◽

Feature Selection ◽

Classification Accuracy ◽

Feature Selection Method ◽

Original Data ◽

Selection Method ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

In data mining, people require feature selection to select relevant features and to remove unimportant irrelevant features from a original data set based on some evolution criteria. Filter and wrapper are the two methods used but here the authors have proposed a hybrid feature selection method to take advantage of both methods. The proposed method uses symmetrical uncertainty and genetic algorithms for selecting the optimal feature subset. This has been done so as to improve processing time by reducing the dimension of the data set without compromising the classification accuracy. This proposed hybrid algorithm is much faster and scales well to the data set in terms of selected features, classification accuracy and running time than most existing algorithms.

Download Full-text

Modulation Recognition of Digital Multimedia Signal Based on Data Feature Selection

International Journal of Mobile Computing and Multimedia Communications ◽

10.4018/ijmcmc.2017070107 ◽

2017 ◽

Vol 8 (3) ◽

pp. 90-111 ◽

Cited By ~ 2

Author(s):

Hui Wang ◽

Li Li Guo ◽

Yun Lin

Keyword(s):

Feature Selection ◽

Information Entropy ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Modulation Recognition ◽

Signal Modulation ◽

Digital Multimedia ◽

Optimal Feature Subset ◽

Optimal Feature

Automatic modulation recognition is very important for the receiver design in the broadband multimedia communication system, and the reasonable signal feature extraction and selection algorithm is the key technology of Digital multimedia signal recognition. In this paper, the information entropy is used to extract the single feature, which are power spectrum entropy, wavelet energy spectrum entropy, singular spectrum entropy and Renyi entropy. And then, the feature selection algorithm of distance measurement and Sequential Feature Selection(SFS) are presented to select the optimal feature subset. Finally, the BP neural network is used to classify the signal modulation. The simulation result shows that the four-different information entropy can be used to classify different signal modulation, and the feature selection algorithm is successfully used to choose the optimal feature subset and get the best performance.

Download Full-text

Optimal Feature Subset Selection for Imbalanced Class Data using SMOTE and Binary ALO Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c4734.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 344-349

Keyword(s):

Feature Selection ◽

Class Imbalance ◽

Classification Performance ◽

Selection Model ◽

Feature Subset Selection ◽

Feature Subset ◽

Spatial Features ◽

Imbalanced Classes ◽

Optimal Feature Subset ◽

Optimal Feature

Feature selection in multispectral high dimensional information is a hard labour machine learning problem because of the imbalanced classes present in the data. The existing Most of the feature selection schemes in the literature ignore the problem of class imbalance by choosing the features from the classes having more instances and avoiding significant features of the classes having less instances. In this paper, SMOTE concept is exploited to produce the required samples form minority classes. Feature selection model is formulated with the objective of reducing number of features with improved classification performance. This model is based on dimensionality reduction by opt for a subset of relevant spectral, textural and spatial features while eliminating the redundant features for the purpose of improved classification performance. Binary ALO is engaged to solve the feature selection model for optimal selection of features. The proposed ALO-SVM with wrapper concept is applied to each potential solution obtained during optimization step. The working of this methodology is tested on LANDSAT multispectral image.

Download Full-text

A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/5/5 ◽

2020 ◽

Vol 17 (5) ◽

pp. 721-730

Author(s):

Kamal Bashir ◽

Tianrui Li ◽

Mahama Yahaya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Feature Selection ◽

Maximum Likelihood ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect ◽

Optimal Feature Subset ◽

Optimal Feature

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process

Download Full-text

Improving the prediction accuracy in blended learning environment using synthetic minority oversampling technique

Information Discovery and Delivery ◽

10.1108/idd-08-2018-0036 ◽

2019 ◽

Vol 47 (2) ◽

pp. 76-83 ◽

Cited By ~ 2

Author(s):

Gabrijela Dimic ◽

Dejan Rancic ◽

Nemanja Macek ◽

Petar Spalevic ◽

Vida Drasute

Keyword(s):

Feature Selection ◽

Learning Environment ◽

Blended Learning ◽

Prediction Accuracy ◽

Design Methodology ◽

Feature Vector ◽

Feature Selection Method ◽

Feature Subset ◽

Content Type ◽

Correlation Based Feature Selection

Purpose This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment. Design/methodology/approach To extract the most relevant activity feature subset, different feature-selection methods were applied. For different cardinality subsets, classification models were used in the comparison. Findings Experimental evaluation oppose the hypothesis that feature vector dimensionality reduction leads to prediction accuracy increasing. Research limitations/implications Improving prediction accuracy in a described learning environment was based on applying synthetic minority oversampling technique, which had affected results on correlation-based feature-selection method. Originality/value The major contribution of the research is the proposed methodology for selecting the optimal low-cardinal subset of students’ activities and significant prediction accuracy improvement in a blended learning environment.

Download Full-text

Effective Evolutionary Multilabel Feature Selection under a Budget Constraint

Complexity ◽

10.1155/2018/3241489 ◽

2018 ◽

Vol 2018 ◽

pp. 1-14 ◽

Cited By ~ 6

Author(s):

Jaesung Lee ◽

Wangduk Seo ◽

Dae-Won Kim

Keyword(s):

Feature Selection ◽

Empirical Studies ◽

Feature Selection Method ◽

Budget Constraint ◽

Feature Subset ◽

Evolutionary Search ◽

Selection Methods ◽

Multilabel Classification ◽

Genetic Search ◽

Real World Datasets

Multilabel feature selection involves the selection of relevant features from multilabeled datasets, resulting in improved multilabel learning accuracy. Evolutionary search-based multilabel feature selection methods have proved useful for identifying a compact feature subset by successfully improving the accuracy of multilabel classification. However, conventional methods frequently violate budget constraints or result in inefficient searches due to ineffective exploration of important features. In this paper, we present an effective evolutionary search-based feature selection method for multilabel classification with a budget constraint. The proposed method employs a novel exploration operation to enhance the search capabilities of a traditional genetic search, resulting in improved multilabel classification. Empirical studies using 20 real-world datasets demonstrate that the proposed method outperforms conventional multilabel feature selection methods.

Download Full-text