Context-Sensitive Attribute Evaluation

Author(s):  
Marko Robnik-Šikonja

The research in machine learning, data mining, and statistics has provided a number of methods that estimate the usefulness of an attribute (feature) for prediction of the target variable. The estimates of attributes’ utility are subsequently used in various important tasks, e.g., feature subset selection, feature weighting, feature ranking, feature construction, data transformation, decision and regression tree building, data discretization, visualization, and comprehension. These tasks frequently occur in data mining, robotics, and in the construction of intelligent systems in general. A majority of attribute evaluation measures used are myopic in a sense that they estimate the quality of one feature independently of the context of other features. In problems which possibly involve much feature interactions these measures are not appropriate. The measures which are historically based on the Relief algorithm (Kira & Rendell, 1992) take context into account through distance between the instances and are efficient in problems with strong dependencies between attributes.

2018 ◽  
Vol 8 (8) ◽  
pp. 1369 ◽  
Author(s):  
Alireza Arabameri ◽  
Biswajeet Pradhan ◽  
Hamid Reza Pourghasemi ◽  
Khalil Rezaei ◽  
Norman Kerle

Gully erosion triggers land degradation and restricts the use of land. This study assesses the spatial relationship between gully erosion (GE) and geo-environmental variables (GEVs) using Weights-of-Evidence (WoE) Bayes theory, and then applies three data mining methods—Random Forest (RF), boosted regression tree (BRT), and multivariate adaptive regression spline (MARS)—for gully erosion susceptibility mapping (GESM) in the Shahroud watershed, Iran. Gully locations were identified by extensive field surveys, and a total of 172 GE locations were mapped. Twelve gully-related GEVs: Elevation, slope degree, slope aspect, plan curvature, convergence index, topographic wetness index (TWI), lithology, land use/land cover (LU/LC), distance from rivers, distance from roads, drainage density, and NDVI were selected to model GE. The results of variables importance by RF and BRT models indicated that distance from road, elevation, and lithology had the highest effect on GE occurrence. The area under the curve (AUC) and seed cell area index (SCAI) methods were used to validate the three GE maps. The results showed that AUC for the three models varies from 0.911 to 0.927, whereas the RF model had a prediction accuracy of 0.927 as per SCAI values, when compared to the other models. The findings will be of help for planning and developing the studied region.


Water ◽  
2018 ◽  
Vol 10 (10) ◽  
pp. 1405 ◽  
Author(s):  
Seyed Naghibi ◽  
Mehdi Vafakhah ◽  
Hossein Hashemi ◽  
Biswajeet Pradhan ◽  
Seyed Alavi

It is a well-known fact that sustainable development goals are difficult to achieve without a proper water resources management strategy. This study tries to implement some state-of-the-art statistical and data mining models i.e., weights-of-evidence (WoE), boosted regression trees (BRT), and classification and regression tree (CART) to identify suitable areas for artificial recharge through floodwater spreading (FWS). At first, suitable areas for the FWS project were identified in a basin in north-eastern Iran based on the national guidelines and a literature survey. Using the same methodology, an identical number of FWS unsuitable areas were also determined. Afterward, a set of different FWS conditioning factors were selected for modeling FWS suitability. The models were applied using 70% of the suitable and unsuitable locations and validated with the rest of the input data (i.e., 30%). Finally, a receiver operating characteristics (ROC) curve was plotted to compare the produced FWS suitability maps. The findings depicted acceptable performance of the BRT, CART, and WoE for FWS suitability mapping with an area under the ROC curves of 92, 87.5, and 81.6%, respectively. Among the considered variables, transmissivity, distance from rivers, aquifer thickness, and electrical conductivity were determined as the most important contributors in the modeling. FWS suitability maps produced by the proposed method in this study could be used as a guideline for water resource managers to control flood damage and obtain new sources of groundwater. This methodology could be easily replicated to produce FWS suitability maps in other regions with similar hydrogeological conditions.


2020 ◽  
Author(s):  
Daniela De Souza Gomes ◽  
Marcos Henrique Fonseca Ribeiro ◽  
Giovanni Ventorim Comarela ◽  
Gabriel Philippe Pereira

High failure rates are a worrying and relevant problem in Brazilian universities. From a data set of student transcripts, we performed a study case for both general and Computer Science contexts, in which Data Mining Techniques were used to find patterns concerning failures. The knowledge acquired can be used for better educational administration and also build intelligent systems to support students’ decision making.


2016 ◽  
Vol 12 (12) ◽  
pp. 4601-4610 ◽  
Author(s):  
D. Palanikkumar ◽  
S. Priya ◽  
S. Priya

Privacy preservation is the data mining technique which is to be applied on the databases without violating the privacy of individuals. The sensitive attribute can be selected from the numerical data and it can be modified by any data modification technique. After modification, the modified data can be released to any agency. If they can apply data mining techniques such as clustering, classification etc for data analysis, the modified data does not affect the result. In privacy preservation technique, the sensitive data is converted into modified data using S-shaped fuzzy membership function. K-means clustering is applied for both original and modified data to get the clusters. t-closeness requires that the distribution of sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. Earth Mover Distance (EMD) is used to measure the distance between the two distributions should be no more than a threshold t. Hence privacy is preserved and accuracy of the data is maintained.


2021 ◽  
Vol 35 (3) ◽  
pp. 209-215
Author(s):  
Pratibha Verma ◽  
Vineet Kumar Awasthi ◽  
Sanat Kumar Sahu

Data mining techniques are included with Ensemble learning and deep learning for the classification. The methods used for classification are, Single C5.0 Tree (C5.0), Classification and Regression Tree (CART), kernel-based Support Vector Machine (SVM) with linear kernel, ensemble (CART, SVM, C5.0), Neural Network-based Fit single-hidden-layer neural network (NN), Neural Networks with Principal Component Analysis (PCA-NN), deep learning-based H2OBinomialModel-Deeplearning (HBM-DNN) and Enhanced H2OBinomialModel-Deeplearning (EHBM-DNN). In this study, experiments were conducted on pre-processed datasets using R programming and 10-fold cross-validation technique. The findings show that the ensemble model (CART, SVM and C5.0) and EHBM-DNN are more accurate for classification, compared with other methods.


Author(s):  
CHANG-HWAN LEE

In spite of its simplicity, naive Bayesian learning has been widely used in many data mining applications. However, the unrealistic assumption that all features are equally important negatively impacts the performance of naive Bayesian learning. In this paper, we propose a new method that uses a Kullback–Leibler measure to calculate the weights of the features analyzed in naive Bayesian learning. Its performance is compared to that of other state-of-the-art methods over a number of datasets.


Sign in / Sign up

Export Citation Format

Share Document