Applications of Feature Engineering Techniques for Text Data

Author(s):  
Shashwati Mishra ◽  
Mrutyunjaya Panda

Feature plays a very important role in the analysis and prediction of data as it carries the most valuable information about the data. This data may be in a structured format or in an unstructured format. Feature engineering process is used to extract features from these data. Selection of features is one of the crucial steps in the feature engineering process. This feature selection process can adopt four different approaches. On that basis, it can be classified into four basic categories, namely filter method, wrapper method, embedded method, and hybrid method. This chapter discusses about different techniques coming under these four categories along with the research work on feature selection.

Author(s):  
Arvind Kumar Tiwari

Feature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This chapter presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This chapter reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.


2020 ◽  
pp. 422-442
Author(s):  
Arvind Kumar Tiwari

Feature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This chapter presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This chapter reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.


Symmetry ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 1666
Author(s):  
Muataz Salam Al-Daweri ◽  
Khairul Akram Zainol Ariffin ◽  
Salwani Abdullah ◽  
Mohamad Firham Efendy Md. Senan

The significant increase in technology development over the internet makes network security a crucial issue. An intrusion detection system (IDS) shall be introduced to protect the networks from various attacks. Even with the increased amount of works in the IDS research, there is a lack of studies that analyze the available IDS datasets. Therefore, this study presents a comprehensive analysis of the relevance of the features in the KDD99 and UNSW-NB15 datasets. Three methods were employed: a rough-set theory (RST), a back-propagation neural network (BPNN), and a discrete variant of the cuttlefish algorithm (D-CFA). First, the dependency ratio between the features and the classes was calculated, using the RST. Second, each feature in the datasets became an input for the BPNN, to measure their ability for a classification task concerning each class. Third, a feature-selection process was carried out over multiple runs, to indicate the frequency of the selection of each feature. From the result, it indicated that some features in the KDD99 dataset could be used to achieve a classification accuracy above 84%. Moreover, a few features in both datasets were found to give a high contribution to increasing the classification’s performance. These features were present in a combination of features that resulted in high accuracy; the features were also frequently selected during the feature selection process. The findings of this study are anticipated to help the cybersecurity academics in creating a lightweight and accurate IDS model with a smaller number of features for the developing technologies.


Author(s):  
U.A. Nuralieva ◽  
A.A. Baisabyrova ◽  
G.A. Moldakhmetova ◽  
K.A. Temirbayeva ◽  
R.Zh. Shimelkova ◽  
...  

One of the ways to intensify the production of beekeeping products is selection. Bee breeding is not only one of the most important methods, but also the most economically efficient way to increase the productivity of bee colonies. Thus, the selection of bees and the implementation of its achievements into production are one of the most important and most effective directions for intensifying beekeeping. Research work was carried out under the project of program-targeted financing of the Ministry of Education and Science of the Republic of Kazakhstan on the topic "Development of technologies for effective management of the selection process in beekeeping." This article examines the characteristics of the morphometric indicators of honeybees in the Almaty region of the Republic of Kazakhstan. The material for the research was the specimens of worker bees from apiaries located in the Almaty region of the Devochkin farm, Panov farm, Kalinin Individual Entrepreneur, Adilgazy Individual Entrepreneur, Kashkimbaev farm. To carry out the study according to the method of A.B. Kartashev, 35 samples of bees were worked out. Changes in the parameters of the wings, including the cubital and dumbbell index, discoidal displacement by bee species: Central Russian, Carpathian, Italian and Carniolian honey bee, are considered. It was found that in Kalinin’s apiary morphometric indicators for the cubital index, the average value was 2,787%. As a result, the morphometric indices for the cubital index in bees of the IP Kalinin bee were 2.777%. Whereas in other farms, the average value was significantly lower for all indicators. Accordingly, the percentage of the cubital index was 7.42-17.36%, the dumbbell index was 6.77-11.81%, and the discoidal displacement was 32.91-47.37%. According to all indicators, it is clear that the Kalinin Individual Entrepreneur’s bee farm is superior to other bee farms in terms of morphometric data. This is due to the isolation of the beekeeping and out of reach of other bees, thus ensuring a low level of hybridization. The considered analysis of the species belonging to the entire apiary, as well as economically useful features, can significantly increase the efficiency of selection work in beekeeping.


2021 ◽  
Vol 16 ◽  
Author(s):  
Chaokun Yan ◽  
Mengyuan Li ◽  
Jingjing Ma ◽  
Yi Liao ◽  
Huimin Luo ◽  
...  

Background: The massive amount of biomedical data accumulated in the past decades can be utilized for diagnosing disease. Objective: However, its high dimensionality, small sample sizes, and irrelevant features often have a negative influence on the accuracy and speed of disease prediction. Some existing machine learning models cannot capture the patterns on these datasets accurately without utilizing feature selection. Methods: Filter and wrapper are two prevailing feature selection methods. The filter method is fast but has low prediction accuracy, while the latter can obtain high accuracy but has a formidable computation cost. Given the drawbacks of using filter or wrapper individually, a novel feature selection method, called MRMR-EFPATS, is proposed, which hybridizes filter method minimum redundancy maximum relevance (MRMR) and wrapper method based on an improved flower pollination algorithm (FPA). First, MRMR is employed to rank and screen out some important features quickly. These features are further chosen into population individual of the following wrapper method for faster convergence and less computational time. Then, due to its efficiency and flexibility, FPA is adopted to further discover an optimal feature subset. Result: FPA still has some drawbacks such as slow convergence rate, inadequacy in terms of searching for new solutions, and tends to be trapped in local optima. In our work, an elite strategy is adopted to improve the convergence speed of the FPA. Tabu search and Adaptive Gaussian Mutation are employed to improve the search capability of FPA and escape from local optima. Here, the KNN classifier with the 5-fold-CV is utilized to evaluate the classification accuracy. Conclusion: Extensive experimental results on six public high dimensional biomedical datasets show that the proposed MRMR-EFPATS has achieved superior performance compared with other state-of-the-art methods.


2006 ◽  
Vol 16 (05) ◽  
pp. 341-352 ◽  
Author(s):  
XIN ZHOU ◽  
K. Z. MAO

Microarray data contains a large number of genes (usually more than 1000) and a relatively small number of samples (usually fewer than 100). This presents problems to discriminant analysis of microarray data. One way to alleviate the problem is to reduce dimensionality of data by selecting important genes to the discriminant problem. Gene selection can be cast as a feature selection problem in the context of pattern classification. Feature selection approaches are broadly grouped into filter methods and wrapper methods. The wrapper method outperforms the filter method but at the cost of more intensive computation. In the present study, we proposed a wrapper-like gene selection algorithm based on the Regularization Network. Compared with classical wrapper method, the computational costs in our gene selection algorithm is significantly reduced, because the evaluation criterion we proposed does not demand repeated training in the leave-one-out procedure.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Sabina Tangaro ◽  
Nicola Amoroso ◽  
Massimo Brescia ◽  
Stefano Cavuoti ◽  
Andrea Chincarini ◽  
...  

Neurodegenerative diseases are frequently associated with structural changes in the brain. Magnetic resonance imaging (MRI) scans can show these variations and therefore can be used as a supportive feature for a number of neurodegenerative diseases. The hippocampus has been known to be a biomarker for Alzheimer disease and other neurological and psychiatric diseases. However, it requires accurate, robust, and reproducible delineation of hippocampal structures. Fully automatic methods are usually the voxel based approach; for each voxel a number of local features were calculated. In this paper, we compared four different techniques for feature selection from a set of 315 features extracted for each voxel: (i) filter method based on the Kolmogorov-Smirnov test; two wrapper methods, respectively, (ii) sequential forward selection and (iii) sequential backward elimination; and (iv) embedded method based on the Random Forest Classifier on a set of 10 T1-weighted brain MRIs and tested on an independent set of 25 subjects. The resulting segmentations were compared with manual reference labelling. By using only 23 feature for each voxel (sequential backward elimination) we obtained comparable state-of-the-art performances with respect to the standard tool FreeSurfer.


With the prompt improvement in research progress of various zones, selection of research proposals became a remarkable methodology in many research funding agencies and organizations. When a less number of research proposals are received, then it is ease to cluster the research proposals and the selection process became as non-problematic way. If a number of research proposals elevated, then the clustering and selecting the proposals became complicated. In current system, proposals grouping is done in manual-based or along with their similarities in subject disciplinaries which yield irrelevant results in some cases. The main goal of this research work is to develop an enhanced system in selection of research proposals based on Domain ontology, where the ontology acts as a searching criteria for the topics of research proposals. This proposed system will help to select the topics of research proposals in well-systematic way without the interference of manual progression. In this paper, an algorithm is proposed as Scikit-learn K-means Multiclass Document Clustering(SKMDC) to group each subject discipline according to their sub-topics and sub-domains. Here, the k-means clustering technique is implemented on categorical data to implement the clustering process. As, the categorical data are not able to applied directly in K-means clustering algorithm, the LabelEncoder method is implemented to encode the text data to numerical values and the dimensions of a dataset are reduced using Principal Component Analysis. This paper also overwhelms the weaknesses of k-means technique in specification of cluster number in initial stage. It is done through the determination of optimal number of clusters by using Elbow Curve method and it is cross-validated through Silhouette Score analysis.


2019 ◽  
Vol 22 (1) ◽  
pp. 44-48
Author(s):  
Syukriyanto Latif

The purpose of this research is to know dimension reduction parameter value at feature selection so as to improve accuracy and reduce computation time. This system uses text mining technology that extracts text data to find information from a set of documents. Word weighting and Term Reduction Technique The term Frequency Thresholding is used in the feature selection process, while in the classification process using the Naive Bayes algorithm. the abstract of the journal is categorized into 3 namely Data Mining (DM), Intelligent Transport System (ITS) and Multimedia (MM). The total number of test data and training data is 150 data. The best classification results are obtained when the dimension reduction parameter value is 30%. At that condition obtained an average accuracy of 87.33% with a computation time of 4 minutes 12 seconds.


2012 ◽  
Vol 3 (3) ◽  
pp. 359-364
Author(s):  
Manish Rai ◽  
Rekha Pandit ◽  
Vineet Richhariya

Multi-class miner resolves the problem of feature evaluation, data drift and concept evaluation of stream data classification. The process of stream data classification in multi-class miner based on ensemble technique of clustering and classification on feature evaluation technique. The process of feature evaluation technique faced a problem of correct point selection of cluster centre for the process of data grouping. For the proper selection of features point we used optimization technique for feature selection process. The feature selection process based on advance genetic algorithm (AGA). The advance genetic algorithm poses a process of feature point for neighbour class detection for finding a correct point in classification. Our proposed algorithm tested on some well know data set provided by UCI machine learning repository. Our empirical evaluation result shows that better result in comparison of multi-class miner for stream data classification.


Sign in / Sign up

Export Citation Format

Share Document