Empirical study on the effect of using synthetic attributes on classification algorithms

Purpose The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course, which are defined as global performance (GP) and local performance (LP). GP of a course is an aggregated performance achieved by all students who have taken this course, and LP of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course. Design/methodology/approach The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes on the classification algorithms. As a research paradigm, the paper follows Cross-Industry Standard Process for Data Mining using RapidMiner Studio software tool. Six classification algorithms are experimented: C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and support vector machines. Findings The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms, and also they have been highly ranked according to their influence to the target variable. Originality/value This paper proposes two synthetic attributes that are integrated into real data set. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.

Download Full-text

Educational Data Classification and prediction using Data Mining Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c6457.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 8674-8678 ◽

Cited By ~ 1

Keyword(s):

Data Mining ◽

Random Forest ◽

Predictive Modeling ◽

Research Work ◽

Logical Structure ◽

Individual Performance ◽

Mining Machine ◽

Support Vector ◽

Classification Algorithms ◽

Data Set

Data Mining is the process of extraction interesting patterns from huge data sets and converts the patterns into logical structure for further Analysis. Predictive Modeling processes that make use of data mining, Machine learning and probability methods to forecast. Engineering is the most widely accepted stream of education in India. Students are uncertain about which department to join in engineering. It is important to improve the individual performance and help the students make the perfect choice regarding the department. In this paper, the hidden information from the previously recorded enrollment details during admission process is used to solve the students’ uncertainty in their choice of department. In addition to this, the performance of alumnae also needs to be analyzed by the teachers to have a clear idea about the future of existing students. Our main goal is to unravel these problems using predictive Modeling. Here, we are focusing on three classification algorithms namely, support vector machine, Random Forest and Naïve Bayes. Data has been collected, normalized and applied to the three different classification algorithms, from which the best model is formulated using various parameters of evaluation. In this paper, we present our approach towards implementing the best model which is built based on the profession of parents, demographic features, type of location of the student and correlation between high school and higher secondary examinations. The Result of this research work shows that Random forest is efficient for the data set used when compared to the other two Classification algorithms.

Download Full-text

Prediction of Student Performance using Hybrid Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrted8241.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6566-6570

Keyword(s):

Data Mining ◽

Academic Performance ◽

Student Performance ◽

Research Work ◽

Educational Data Mining ◽

Classification Algorithm ◽

Classification Algorithms ◽

Data Types ◽

Data Set ◽

Hybrid Classification

Data mining technologies allow collection, storage and processing huge amounts of data and carrying a large variety of data types and samples. Predicting academic performance of student is the most successive research in this era. Previous research work researchers are used different classification algorithm to predict the student performance. There is lot of research work to be taken in the field of educational data mining and big data in education to increase the accuracy of the classification algorithm and predict the academic performance of student. In this research work we used hybrid classification algorithm for predicting the performance of students. Two Popular classification algorithms ID3 and J48 were applied on the data set. To make hybrid classification voting technique is applied using weka machine learning tool. In this work we tested how the hybrid algorithm accurately predicts the student data set. To check the predicted result classification accuracy was computed. This hybrid classification algorithm gives accuracy with 62.67%.

Download Full-text

Offering a hybrid approach of data mining to predict the customer churn based on bagging and boosting methods

Kybernetes ◽

10.1108/k-07-2015-0172 ◽

2016 ◽

Vol 45 (5) ◽

pp. 732-743 ◽

Cited By ~ 6

Author(s):

Mohammad Fathian ◽

Yaser Hoseinpoor ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Data Mining ◽

Hybrid Approach ◽

Support Vector ◽

Self Organizing Map ◽

Ensemble Classifiers ◽

Classification Models ◽

Classifier Ensembles ◽

Data Set ◽

Content Type ◽

Customer Churn

Purpose – Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature has introduced data mining approaches for this purpose. On the other hand, results indicate that performance of classification models increases by combining two or more techniques. The purpose of this paper is to propose a combined model based on clustering and ensemble classifiers. Design/methodology/approach – Based on churn data set in Cell2Cell, single baseline classifiers, ensemble classifiers are used for comparisons. Specifically, self-organizing map (SOM) clustering technique, and four other classifier techniques including decision tree, artificial neural networks, support vector machine, and K-nearest neighbors were used. Moreover, for reduced dimensions of the features, principal component analysis (PCA) method was employed. Findings – As results 14 models are compared with each other regarding accuracy, sensitivity, specification, F-measure, and AUC. The results showed that combination of SOM, PCA, and heterogeneous boosting achieved the best performance comparing with other classification models. Originality/value – This study examined the performance of classifier ensembles in predicting customers churn. In particular, heterogeneous classifier ensembles such as bagging and boosting are compared.

Download Full-text

Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

Plants ◽

10.3390/plants10010095 ◽

2021 ◽

Vol 10 (1) ◽

pp. 95

Author(s):

Heba Kurdi ◽

Amal Al-Aldawsari ◽

Isra Al-Turaiki ◽

Abdulrahman S. Aldawood

Keyword(s):

Data Mining ◽

Plant Size ◽

Support Vector ◽

Classification Algorithms ◽

Palm Tree ◽

Rhynchophorus Ferrugineus ◽

Red Palm Weevil ◽

Palm Weevil ◽

Using Data ◽

F Measure

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.

Download Full-text

Characterization of Road Condition with Data Mining Based on Measured Kinematic Vehicle Parameters

Journal of Advanced Transportation ◽

10.1155/2018/8647607 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Johannes Masino ◽

Jakob Thumm ◽

Guillaume Levasseur ◽

Michael Frey ◽

Frank Gauterin ◽

...

Keyword(s):

Data Mining ◽

Support Vector ◽

Matlab Toolbox ◽

Data Set ◽

The Road ◽

Acceleration Sensors ◽

Road Surfaces ◽

Road Condition ◽

Sensor Signals

This work aims at classifying the road condition with data mining methods using simple acceleration sensors and gyroscopes installed in vehicles. Two classifiers are developed with a support vector machine (SVM) to distinguish between different types of road surfaces, such as asphalt and concrete, and obstacles, such as potholes or railway crossings. From the sensor signals, frequency-based features are extracted, evaluated automatically with MANOVA. The selected features and their meaning to predict the classes are discussed. The best features are used for designing the classifiers. Finally, the methods, which are developed and applied in this work, are implemented in a Matlab toolbox with a graphical user interface. The toolbox visualizes the classification results on maps, thus enabling manual verification of the results. The accuracy of the cross-validation of classifying obstacles yields 81.0% on average and of classifying road material 96.1% on average. The results are discussed on a comprehensive exemplary data set.

Download Full-text

A Survey on Major Classification Algorithms and Comparative Analysis of Few Classification Algorithms on Contact Lenses Data Set Using Data Mining Tool

New Trends in Computational Vision and Bio-inspired Computing ◽

10.1007/978-3-030-41862-5_121 ◽

2020 ◽

pp. 1201-1209

Author(s):

Syed Nawaz Pasha ◽

D. Ramesh ◽

Mohammad Sallauddin

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Contact Lenses ◽

Classification Algorithms ◽

Data Set ◽

Data Mining Tool ◽

Mining Tool ◽

Using Data

Download Full-text

A Survey on Phishing Detection and The Importance of Feature Selection In Data Mining Classification Algorithms

Issue 4 - Journal of Science and Technology ◽

10.46243/jst.2020.v5.i6.pp11-18 ◽

2020 ◽

pp. 11-18

Keyword(s):

Data Mining ◽

Feature Selection ◽

Support Vector ◽

Classification Algorithms ◽

End User ◽

Preparation Methods ◽

Survey Paper ◽

Vector Machines ◽

Feature Selection Techniques ◽

Phishing Detection

: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.

Download Full-text

A Data-Driven Exploratory Approach for Level Curve Estimation With Autonomous Underwater Agents

Volume 2: Mechatronics; Estimation and Identification; Uncertain Systems and Robustness; Path Planning and Motion Control; Tracking Control Systems; Multi-Agent and Networked Systems; Manufacturing; Intelligent Transportation and Vehicles; Sensors and Actuators; Diagnostics and Detection; Unmanned, Ground and Surface Robotics; Motion and Vibration Control Applications ◽

10.1115/dscc2017-5118 ◽

2017 ◽

Author(s):

Hsien-Chung Lin ◽

Eugen Solowjow ◽

Masayoshi Tomizuka ◽

Edwin Kreuzer

Keyword(s):

Concentration Field ◽

Real Data ◽

Support Vector ◽

Level Curve ◽

Data Set ◽

Curve Estimation ◽

Numerical Studies ◽

Exploratory Approach ◽

Vector Machines ◽

Myopic Strategy

This contribution presents a method to estimate environmental boundaries with mobile agents. The agents sample a concentration field of interest at their respective positions and infer a level curve of the unknown field. The presented method is based on support vector machines (SVMs), whereby the concentration level of interest serves as the decision boundary. The field itself does not have to be estimated in order to obtain the level curve which makes the method computationally very appealing. A myopic strategy is developed to pick locations that yield most informative concentration measurements. Cooperative operations of multiple agents are demonstrated by dividing the domain in Voronoi tessellations. Numerical studies demonstrate the feasibility of the method on a real data set of the California coastal area. The exploration strategy is benchmarked against random walk which it clearly outperforms.

Download Full-text

Soft computing based audio signal analysis for accident prediction

International Journal of Pervasive Computing and Communications ◽

10.1108/ijpcc-08-2020-0120 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Hima Bindu Valiveti ◽

Anil Kumar B. ◽

Lakshmi Chaitanya Duggineni ◽

Swetha Namburu ◽

Swaraja Kuraparthi

Keyword(s):

Feature Extraction ◽

Audio Signal ◽

Support Vector ◽

Classification Algorithms ◽

Spectral Features ◽

Road Accidents ◽

Content Type ◽

Detection Techniques ◽

Car Crashes ◽

High Level

Purpose Road accidents, an inadvertent mishap can be detected automatically and alerts sent instantly with the collaboration of image processing techniques and on-road video surveillance systems. However, to rely exclusively on visual information especially under adverse conditions like night times, dark areas and unfavourable weather conditions such as snowfall, rain, and fog which result in faint visibility lead to incertitude. The main goal of the proposed work is certainty of accident occurrence. Design/methodology/approach The authors of this work propose a method for detecting road accidents by analyzing audio signals to identify hazardous situations such as tire skidding and car crashes. The motive of this project is to build a simple and complete audio event detection system using signal feature extraction methods to improve its detection accuracy. The experimental analysis is carried out on a publicly available real time data-set consisting of audio samples like car crashes and tire skidding. The Temporal features of the recorded audio signal like Energy Volume Zero Crossing Rate 28ZCR2529 and the Spectral features like Spectral Centroid Spectral Spread Spectral Roll of factor Spectral Flux the Psychoacoustic features Energy Sub Bands ratio and Gammatonegram are computed. The extracted features are pre-processed and trained and tested using Support Vector Machine (SVM) and K-nearest neighborhood (KNN) classification algorithms for exact prediction of the accident occurrence for various SNR ranges. The combination of Gammatonegram with Temporal and Spectral features of the validates to be superior compared to the existing detection techniques. Findings Temporal, Spectral, Psychoacoustic features, gammetonegram of the recorded audio signal are extracted. A High level vector is generated based on centroid and the extracted features are classified with the help of machine learning algorithms like SVM, KNN and DT. The audio samples collected have varied SNR ranges and the accuracy of the classification algorithms is thoroughly tested. Practical implications Denoising of the audio samples for perfect feature extraction was a tedious chore. Originality/value The existing literature cites extraction of Temporal and Spectral features and then the application of classification algorithms. For perfect classification, the authors have chosen to construct a high level vector from all the four extracted Temporal, Spectral, Psycho acoustic and Gammetonegram features. The classification algorithms are employed on samples collected at varied SNR ranges.

Download Full-text

Applying data mining algorithms to real estate appraisals: a comparative study

International Journal of Housing Markets and Analysis ◽

10.1108/ijhma-07-2020-0080 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Thiago Cesar de Oliveira ◽

Lúcio de Medeiros ◽

Daniel Henrique Marco Detzel

Keyword(s):

Data Mining ◽

Real Estate ◽

Support Vector ◽

Predictive Capacity ◽

Content Type ◽

Data Mining Algorithms ◽

Wide Range ◽

Very Large Databases ◽

Mining Algorithms ◽

Statistical Results

Purpose Real estate appraisals are becoming an increasingly important means of backing up financial operations based on the values of these kinds of assets. However, in very large databases, there is a reduction in the predictive capacity when traditional methods, such as multiple linear regression (MLR), are used. This paper aims to determine whether in these cases the application of data mining algorithms can achieve superior statistical results. First, real estate appraisal databases from five towns and cities in the State of Paraná, Brazil, were obtained from Caixa Econômica Federal bank. Design/methodology/approach After initial validations, additional databases were generated with both real, transformed and nominal values, in clean and raw data. Each was assisted by the application of a wide range of data mining algorithms (multilayer perceptron, support vector regression, K-star, M5Rules and random forest), either isolated or combined (regression by discretization – logistic, bagging and stacking), with the use of 10-fold cross-validation in Weka software. Findings The results showed more varied incremental statistical results with the use of algorithms than those obtained by MLR, especially when combined algorithms were used. The largest increments were obtained in databases with a large amount of data and in those where minor initial data cleaning was carried out. The paper also conducts a further analysis, including an algorithmic ranking based on the number of significant results obtained. Originality/value The authors did not find similar studies or research studies conducted in Brazil.

Download Full-text