An Ensemble Voted Feature Selection Technique for Predictive Modeling of Malwares of Android

Author(s):  
Abhishek Bhattacharya ◽  
Radha Tamal Goswami ◽  
Kuntal Mukherjee ◽  
Nhu Gia Nguyen

Each Android application requires accumulations of permissions in installation time and they are considered as the features which can be utilized in permission-based identification of Android malwares. Recently, ensemble feature selection techniques have received increasing attention over conventional techniques in different applications. In this work, a cluster based voted ensemble voted feature selection technique combining five base wrapper approaches of R libraries is projected for identifying most prominent set of features in the predictive modeling of Android malwares. The proposed method preserves both the desirable features of an ensemble feature selector, accuracy and diversity. Moreover, in this work, five different data partitioning ratios are considered and the impact of those ratios on predictive model are measured using coefficient of determination (r-square) and root mean square error. The proposed strategy has created significant better outcome in term of the number of selected features and classification accuracy.

Electronics ◽  
2021 ◽  
Vol 10 (17) ◽  
pp. 2099
Author(s):  
Paweł Ziemba ◽  
Jarosław Becker ◽  
Aneta Becker ◽  
Aleksandra Radomska-Zalas ◽  
Mateusz Pawluk ◽  
...  

One of the important research problems in the context of financial institutions is the assessment of credit risk and the decision to whether grant or refuse a loan. Recently, machine learning based methods are increasingly employed to solve such problems. However, the selection of appropriate feature selection technique, sampling mechanism, and/or classifiers for credit decision support is very challenging, and can affect the quality of the loan recommendations. To address this challenging task, this article examines the effectiveness of various data science techniques in issue of credit decision support. In particular, processing pipeline was designed, which consists of methods for data resampling, feature discretization, feature selection, and binary classification. We suggest building appropriate decision models leveraging pertinent methods for binary classification, feature selection, as well as data resampling and feature discretization. The selected models’ feasibility analysis was performed through rigorous experiments on real data describing the client’s ability for loan repayment. During experiments, we analyzed the impact of feature selection on the results of binary classification, and the impact of data resampling with feature discretization on the results of feature selection and binary classification. After experimental evaluation, we found that correlation-based feature selection technique and random forest classifier yield the superior performance in solving underlying problem.


2020 ◽  
Vol 17 (9) ◽  
pp. 4106-4110
Author(s):  
Mausumi Goswami ◽  
B. S. Purkayastha

Unstructured Data is utilized in many major applications. It seems 80% of the data generated by various business applications are unstructured. Unstructured data can not be directly processed to generate information. Few major applications which uses AI are Recommendation systems, Sentiment Analysis of customer’s emotions, finding duplicate content through plagiarism detection, document organization based on the requirements etc. Different forms of origin of such data can be categorized as unstructured text on the World Wide Web, sensor data, digital images, videos, sound, result of scientific experiments and user profiles for marketing. Information retrieval from huge text datasets is quite challenging. This is caused by the various characteristics associated with natural languages and a major concern in text mining. Before we apply computational techniques on documents it is important to make the documents ready for processing. Document Preprocessing is one such method applied for text documents. Document Preprocessing plays a vital role in document grouping. In this paper, four feature selection techniques are implemented and empirical investigation results are included. The evaluation of the grouping outcomes are used to evaluate the effectiveness of each feature selection technique. The evaluation of the grouping outcomes are done to evaluate the effectiveness of each feature selection technique.


Author(s):  
Vaishali Arya ◽  
Rashmi Agrawal

Aims: Feature Selection Techniques for Text Data Composed of Heterogeneous sources for sentiment classification. Objectives: The objective of work is to analyze the feature selection technique for text gathered from different sources to increase the accuracy of sentiment classification done on microblogs. Methods: Applied three feature selection techniques Bag-of-Word(BOW), TF-IDF, and word2vector to find the most suitable feature selection techniques for heterogeneous datasets. Results: TF-IDF outperforms outh of the three selected feature selection technique for sentiment classification with SVM classifier. Conclusion: Feature selection is an integral part of any data preprocessing task, and along with that, it is also important for the machine learning algorithms in achieving good accuracy in classification results. Hence it is essential to find out the best suitable approach for heterogeneous sources of data. The heterogeneous sources are rich sources of information and they also play an important role in developing a model for adaptable systems as well. So keeping that also in mind we have compared the three techniques for heterogeneous source data and found that TF-IDF is the most suitable one for all types of data whether it is balanced or imbalanced data, it is a single source or multiple source data. In all cases, TF-IDF approach is the most promising approach in generating the results for the classification of sentiments of users.


2013 ◽  
Vol 22 (05) ◽  
pp. 1360010 ◽  
Author(s):  
HUANJING WANG ◽  
TAGHI M. KHOSHGOFTAAR ◽  
QIANHUI (ALTHEA) LIANG

Software metrics (features or attributes) are collected during the software development cycle. Metric selection is one of the most important preprocessing steps in the process of building defect prediction models and may improve the final prediction result. However, the addition or removal of program modules (instances or samples) can alter the subsets chosen by a feature selection technique, rendering the previously-selected feature sets invalid. Very limited research have been done considering both stability (or robustness) and defect prediction model performance together in the software engineering domain, despite the importance of both aspects when choosing a feature selection technique. In this paper, we test the stability and classification model performance of eighteen feature selection techniques as the magnitude of change to the datasets and the size of the selected feature subsets are varied. All experiments were conducted on sixteen datasets from three real-world software projects. The experimental results demonstrate that Gain Ratio shows the least stability while two different versions of ReliefF show the most stability, followed by the PRC- and AUC-based threshold-based feature selection techniques. Results also show that the signal-to-noise ranker performed moderately in terms of robustness and was the best ranker in terms of model performance. Finally, we conclude that while for some rankers, stability and classification performance are correlated, this is not true for other rankers, and therefore performance according to one scheme (stability or model performance) cannot be used to predict performance according to the other.


Author(s):  
Hua Tang ◽  
Chunmei Zhang ◽  
Rong Chen ◽  
Po Huang ◽  
Chenggang Duan ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document