scholarly journals Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining

Stats ◽  
2021 ◽  
Vol 4 (3) ◽  
pp. 682-700
Author(s):  
Jonatha Sousa Pimentel ◽  
Raydonal Ospina ◽  
Anderson Ara

The development of a country involves directly investing in the education of its citizens. Learning analytics/educational data mining (LA/EDM) allows access to big observational structured/unstructured data captured from educational settings and relies mostly on machine learning algorithms to extract useful information. Support vector regression (SVR) is a supervised statistical learning approach that allows modelling and predicts the performance tendency of students to direct strategic plans for the development of high-quality education. In Brazil, performance can be evaluated at the national level using the average grades of a student on their National High School Exams (ENEMs) based on their socioeconomic information and school records. In this paper, we focus on increasing the computational efficiency of SVR applied to ENEM for online requisitions. The results are based on an analysis of a massive data set composed of more than five million observations, and they also indicate computational learning time savings of more than 90%, as well as providing a prediction of performance that is compatible with traditional modeling.

Author(s):  
Selami Bagriyanik ◽  
Adem Karahoca

Cosmic Function Point (CFP) measurement errors leads budget, schedule and quality problems in software projects. Therefore, it’s important to identify and plan requirements engineers’ CFP training need quickly and correctly. The purpose of this paper is to identify software requirements engineers’ COSMIC Function Point measurement competence development need by using machine learning algorithms and requirements artifacts created by engineers. Used artifacts have been provided by a large service and technology company ecosystem in Telco. First, feature set has been extracted from the requirements model at hand. To do the data preparation for educational data mining, requirements and COSMIC Function Point (CFP) audit documents have been converted into CFP data set based on the designed feature set. This data set has been used to train and test the machine learning models by designing two different experiment settings to reach statistically significant results. Ten different machine learning algorithms have been used. Finally, algorithm performances have been compared with a baseline and each other to find the best performing models on this data set. In conclusion, REPTree, OneR, and Support Vector Machines (SVM) with Sequential Minimal Optimization (SMO) algorithms achieved top performance in forecasting requirements engineers’ CFP training need.


Author(s):  
Shler Farhad Khorshid ◽  
Adnan Mohsin Abdulazeez ◽  
Amira Bibo Sallow

Breast cancer is one of the most common diseases among women, accounting for many deaths each year. Even though cancer can be treated and cured in its early stages, many patients are diagnosed at a late stage. Data mining is the method of finding or extracting information from massive databases or datasets, and it is a field of computer science with a lot of potentials. It covers a wide range of areas, one of which is classification. Classification may also be accomplished using a variety of methods or algorithms. With the aid of MATLAB, five classification algorithms were compared. This paper presents a performance comparison among the classifiers: Support Vector Machine (SVM), Logistics Regression (LR), K-Nearest Neighbors (K-NN), Weighted K-Nearest Neighbors (Weighted K-NN), and Gaussian Naïve Bayes (Gaussian NB). The data set was taken from UCI Machine learning Repository. The main objective of this study is to classify breast cancer women using the application of machine learning algorithms based on their accuracy. The results have revealed that Weighted K-NN (96.7%) has the highest accuracy among all the classifiers.


Air pollution has a serious impact on human health. It occurs because of natural and man-made factors. The major contribution of this research is that it provides a comparison between different methodologies and techniques of mathematical and machine learning models. The process began with integrating data from different sources at different time interval. The preprocessing phase resulted in two different datasets: one-hour and five-minute datasets. Next, we established a forecasting model for particulate matter PM2.5, which is one of the most prevalent air pollutants and its concentration affects air quality. Additionally, we completed a multivariate analysis to predict the PM2.5 value and check the effects of other air pollutants, traffic, and weather. The algorithms used are support vector regression, k-nearest neighbors and decision tree models. The results showed that for the one-hour data set, of the three algorithms, support vector regression has the least root-mean-square error (RMSE) and also lowest value in mean absolute error (MAE). Alternatively, for the five-minute dataset, we found that the auto-regression model showed the least RMSE and MAE; however, this model only predicts short-term PM2.5.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Johannes Masino ◽  
Jakob Thumm ◽  
Guillaume Levasseur ◽  
Michael Frey ◽  
Frank Gauterin ◽  
...  

This work aims at classifying the road condition with data mining methods using simple acceleration sensors and gyroscopes installed in vehicles. Two classifiers are developed with a support vector machine (SVM) to distinguish between different types of road surfaces, such as asphalt and concrete, and obstacles, such as potholes or railway crossings. From the sensor signals, frequency-based features are extracted, evaluated automatically with MANOVA. The selected features and their meaning to predict the classes are discussed. The best features are used for designing the classifiers. Finally, the methods, which are developed and applied in this work, are implemented in a Matlab toolbox with a graphical user interface. The toolbox visualizes the classification results on maps, thus enabling manual verification of the results. The accuracy of the cross-validation of classifying obstacles yields 81.0% on average and of classifying road material 96.1% on average. The results are discussed on a comprehensive exemplary data set.


Student Performance Management is one of the key pillars of the higher education institutions since it directly impacts the student’s career prospects and college rankings. This paper follows the path of learning analytics and educational data mining by applying machine learning techniques in student data for identifying students who are at the more likely to fail in the university examinations and thus providing needed interventions for improved student performance. The Paper uses data mining approach with 10 fold cross validation to classify students based on predictors which are demographic and social characteristics of the students. This paper compares five popular machine learning algorithms Rep Tree, Jrip, Random Forest, Random Tree, Naive Bayes algorithms based on overall classifier accuracy as well as other class specific indicators i.e. precision, recall, f-measure. Results proved that Rep tree algorithm outperformed other machine learning algorithms in classifying students who are at more likely to fail in the examinations.


2017 ◽  
Vol 11 (8) ◽  
pp. 92
Author(s):  
Waleed Dhhan ◽  
Habshah Midi ◽  
Thaera Alameer

Support vector regression is used to evaluate the linear and non-linear relationships among variables. Although it is non-parametric technique, it is still affected by outliers, because the possibility to select them as support vectors. In this article, we proposed a robust support vector regression for linear and nonlinear target functions. In order to carry out this goal, the support vector regression model with fixed parameters is used to detect and minimize the effects of abnormal points in the data set. The efficiency of the proposed method is investigated by using real and simulation examples.


2015 ◽  
Vol 813-814 ◽  
pp. 1104-1113 ◽  
Author(s):  
A. Sumesh ◽  
Dinu Thomas Thekkuden ◽  
Binoy B. Nair ◽  
K. Rameshkumar ◽  
K. Mohandas

The quality of weld depends upon welding parameters and exposed environment conditions. Improper selection of welding process parameter is one of the important reasons for the occurrence of weld defect. In this work, arc sound signals are captured during the welding of carbon steel plates. Statistical features of the sound signals are extracted during the welding process. Data mining algorithms such as Naive Bayes, Support Vector Machines and Neural Network were used to classify the weld conditions according to the features of the sound signal. Two weld conditions namely good weld and weld with defects namely lack of fusion, and burn through were considered in this study. Classification efficiencies of machine learning algorithms were compared. Neural network is found to be producing better classification efficiency comparing with other algorithms considered in this study.


2020 ◽  
Vol 19 ◽  
pp. 153303382090982
Author(s):  
Melek Akcay ◽  
Durmus Etiz ◽  
Ozer Celik ◽  
Alaattin Ozen

Background and Aim: Although the prognosis of nasopharyngeal cancer largely depends on a classification based on the tumor-lymph node metastasis staging system, patients at the same stage may have different clinical outcomes. This study aimed to evaluate the survival prognosis of nasopharyngeal cancer using machine learning. Settings and Design: Original, retrospective. Materials and Methods: A total of 72 patients with a diagnosis of nasopharyngeal cancer who received radiotherapy ± chemotherapy were included in the study. The contribution of patient, tumor, and treatment characteristics to the survival prognosis was evaluated by machine learning using the following techniques: logistic regression, artificial neural network, XGBoost, support-vector clustering, random forest, and Gaussian Naive Bayes. Results: In the analysis of the data set, correlation analysis, and binary logistic regression analyses were applied. Of the 18 independent variables, 10 were found to be effective in predicting nasopharyngeal cancer-related mortality: age, weight loss, initial neutrophil/lymphocyte ratio, initial lactate dehydrogenase, initial hemoglobin, radiotherapy duration, tumor diameter, number of concurrent chemotherapy cycles, and T and N stages. Gaussian Naive Bayes was determined as the best algorithm to evaluate the prognosis of machine learning techniques (accuracy rate: 88%, area under the curve score: 0.91, confidence interval: 0.68-1, sensitivity: 75%, specificity: 100%). Conclusion: Many factors affect prognosis in cancer, and machine learning algorithms can be used to determine which factors have a greater effect on survival prognosis, which then allows further research into these factors. In the current study, Gaussian Naive Bayes was identified as the best algorithm for the evaluation of prognosis of nasopharyngeal cancer.


Diagnostics ◽  
2019 ◽  
Vol 9 (3) ◽  
pp. 104 ◽  
Author(s):  
Ahmed ◽  
Yigit ◽  
Isik ◽  
Alpkocak

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.


Author(s):  
RYO INOKUCHI ◽  
SADAAKI MIYAMOTO

Recently kernel methods in support vector machines have widely been used in machine learning algorithms to obtain nonlinear models. Clustering is an unsupervised learning method which divides whole data set into subgroups, and popular clustering algorithms such as c-means are employing kernel methods. Other kernel-based clustering algorithms have been inspired from kernel c-means. However, the formulation of kernel c-means has a high computational complexity. This paper gives an alternative formulation of kernel-based clustering algorithms derived from competitive learning clustering. This new formulation obviously uses sequential updating or on-line learning to avoid high computational complexity. We apply kernel methods to related algorithms: learning vector quantization and self-organizing map. We moreover consider kernel methods for sequential c-means and its fuzzy version by the proposed formulation.


Sign in / Sign up

Export Citation Format

Share Document