scholarly journals ROC curve, lift chart and calibration plot

2006 ◽  
Vol 3 (1) ◽  
Author(s):  
Miha Vuk ◽  
Tomaž Curk

This paper presents ROC curve, lift chart and calibration plot, three well known graphical techniques that are useful for evaluating the quality of classification models used in data mining and machine learning. Each technique, normally used and studied separately, defines its own measure of classification quality and its visualization. Here, we give a brief survey of the methods and establish a common mathematical framework which adds some new aspects, explanations and interrelations between these techniques. We conclude with an empirical evaluation and a few examples on how to use the presented techniques to boost classification accuracy.

2021 ◽  
Vol 40 (5) ◽  
pp. 9361-9382 ◽  
Author(s):  
Naeem Iqbal ◽  
Rashid Ahmad ◽  
Faisal Jamil ◽  
Do-Hyeun Kim

Quality prediction plays an essential role in the business outcome of the product. Due to the business interest of the concept, it has extensively been studied in the last few years. Advancement in machine learning (ML) techniques and with the advent of robust and sophisticated ML algorithms, it is required to analyze the factors influencing the success of the movies. This paper presents a hybrid features prediction model based on pre-released and social media data features using multiple ML techniques to predict the quality of the pre-released movies for effective business resource planning. This study aims to integrate pre-released and social media data features to form a hybrid features-based movie quality prediction (MQP) model. The proposed model comprises of two different experimental models; (i) predict movies quality using the original set of features and (ii) develop a subset of features based on principle component analysis technique to predict movies success class. This work employ and implement different ML-based classification models, such as Decision Tree (DT), Support Vector Machines with the linear and quadratic kernel (L-SVM and Q-SVM), Logistic Regression (LR), Bagged Tree (BT) and Boosted Tree (BOT), to predict the quality of the movies. Different performance measures are utilized to evaluate the performance of the proposed ML-based classification models, such as Accuracy (AC), Precision (PR), Recall (RE), and F-Measure (FM). The experimental results reveal that BT and BOT classifiers performed accurately and produced high accuracy compared to other classifiers, such as DT, LR, LSVM, and Q-SVM. The BT and BOT classifiers achieved an accuracy of 90.1% and 89.7%, which shows an efficiency of the proposed MQP model compared to other state-of-art- techniques. The proposed work is also compared with existing prediction models, and experimental results indicate that the proposed MQP model performed slightly better compared to other models. The experimental results will help the movies industry to formulate business resources effectively, such as investment, number of screens, and release date planning, etc.


2021 ◽  
pp. 097215092098485
Author(s):  
Sonika Gupta ◽  
Sushil Kumar Mehta

Data mining techniques have proven quite effective not only in detecting financial statement frauds but also in discovering other financial crimes, such as credit card frauds, loan and security frauds, corporate frauds, bank and insurance frauds, etc. Classification of data mining techniques, in recent years, has been accepted as one of the most credible methodologies for the detection of symptoms of financial statement frauds through scanning the published financial statements of companies. The retrieved literature that has used data mining classification techniques can be broadly categorized on the basis of the type of technique applied, as statistical techniques and machine learning techniques. The biggest challenge in executing the classification process using data mining techniques lies in collecting the data sample of fraudulent companies and mapping the sample of fraudulent companies against non-fraudulent companies. In this article, a systematic literature review (SLR) of studies from the area of financial statement fraud detection has been conducted. The review has considered research articles published between 1995 and 2020. Further, a meta-analysis has been performed to establish the effect of data sample mapping of fraudulent companies against non-fraudulent companies on the classification methods through comparing the overall classification accuracy reported in the literature. The retrieved literature indicates that a fraudulent sample can either be equally paired with non-fraudulent sample (1:1 data mapping) or be unequally mapped using 1:many ratio to increase the sample size proportionally. Based on the meta-analysis of the research articles, it can be concluded that machine learning approaches, in comparison to statistical approaches, can achieve better classification accuracy, particularly when the availability of sample data is low. High classification accuracy can be obtained with even a 1:1 mapping data set using machine learning classification approaches.


2021 ◽  
Vol 10 (7) ◽  
pp. 436
Author(s):  
Amerah Alghanim ◽  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Gavin McArdle

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.


Dependability ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 54-64
Author(s):  
O. B. Pronevich ◽  
M. V. Zaitsev

The paper Aims to examine various approaches to the ways of improving the quality of predictions and classification of unbalanced data that allow improving the accuracy of rare event classification. When predicting the onset of rare events using machine learning techniques, researchers face the problem of inconsistency between the quality of trained models and their actual ability to correctly predict the occurrence of a rare event. The paper examines model training under unbalanced initial data. The subject of research is the information on incidents and hazardous events at railway power supply facilities. The problem of unbalanced data is expressed in the noticeable imbalance between the types of observed events, i.e., the numbers of instances. Methods. While handling unbalanced data, depending on the nature of the problem at hand, the quality and size of the initial data, various Data Science-based techniques of improving the quality of classification models and prediction are used. Some of those methods are focused on attributes and parameters of classification models. Those include FAST, CFS, fuzzy classifiers, GridSearchCV, etc. Another group of methods is oriented towards generating representative subsets out of initial datasets, i.e., samples. Data sampling techniques allow examining the effect of class proportions on the quality of machine learning. In particular, in this paper, the NearMiss method is considered in detail. Results. The problem of class imbalance in respect to the analysis of the number of incidents at railway facilities has existed since 2015. Despite the decreasing share of hazardous events at railway power supply facilities in the three years since 2018, an increase in the number of such events cannot be ruled out. Monthly statistics of hazardous event distribution exhibit no trend for declines and peaks. In this context, the optimal period of observation of the number of incidents and hazardous events is a month. A visualization of the class ratio has shown the absence of a clear boundary between the members of the majority class (incidents) and those of the minority class (hazardous events). The class ratio was studied in two and three dimensions, in actual values and using the method of main components. Such “proximity” of classes is one of the causes of wrong predictions. In this paper, the authors analysed past research of the ways of improving the quality of machine learning based on unbalanced data. The terms that describe the degree of class imbalances have been defined and clarified. The strengths and weaknesses of 50 various methods of handling such data were studied and set forth. Out of the set of methods of handling the numbers of class members as part of the classification (prediction of the occurrence) of rare hazardous events in railway transportation, the NearMiss method was chosen. It allows experimenting with the ratios and methods of selecting class members. As the results of a series of experiments, the accuracy of rare hazardous event classification was improved from 0 to 70-90%.


2013 ◽  
Vol 380-384 ◽  
pp. 1469-1472
Author(s):  
Gui Jun Shan

Partition methods for real data play an extremely important role in decision tree algorithms in data mining and machine learning because the decision tree algorithms require that the values of attributes are discrete. In this paper, we propose a novel partition method for real data in decision tree using statistical criterion. This method constructs a statistical criterion to find accurate merging intervals. In addition, we present a heuristic partition algorithm to achieve a desired partition result with the aim to improve the performance of decision tree algorithms. Empirical experiments on UCI real data show that the new algorithm generates a better partition scheme that improves the classification accuracy of C4.5 decision tree than existing algorithms.


Author(s):  
Mohsin Iqbal ◽  
Saif Ur Rehman ◽  
Saira Gillani ◽  
Sohail Asghar

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.


Author(s):  
Damian Mora ◽  
José Antonio Nieto ◽  
Jorge Mateo ◽  
Behnood Bikdeli ◽  
Stefano Barco ◽  
...  

Background: Patients with pulmonary embolism (PE) who prematurely discontinue anticoagulant therapy (<90 days) are at an increased risk for death or recurrences. Methods: We used the data from the RIETE registry to compare the prognostic ability of 5 machine-learning (ML) models and logistic regression to identify patients at increased risk for the composite of fatal PE or recurrent venous thromboembolism (VTE) 30 days after discontinuation. ML models included Decision tree, K-Nearest Neighbors algorithm, Support Vector Machine, Ensemble and Neural Network [NN]. A “full” model with 70 variables and a “reduced” model with 23 were analyzed. Model performance was assessed by confusion matrix metrics on the testing data for each model and a calibration plot. Results: Among 34,447 patients with PE, 1,348 (3.9%) discontinued therapy prematurely. Fifty-one (3.8%) developed fatal PE or sudden death and 24 (1.8%) had non-fatal VTE recurrences within 30 days after discontinuation. ML-NN was the best method for identification of patients experiencing the composite endpoint, predicting the composite outcome with an area under receiver operating characteristics (ROC) curve of 0.96 (95% confidence intervals [CI], 0.95-0.98), using either 70 or 23 variables captured before discontinuation. Similar numbers were obtained for sensitivity, specificity, positive predictive value, negative predictive value and accuracy. The discrimination of logistic regression was inferior (area under ROC curve, 0.76 [95% Cl 0.70-0.81]). Calibration plot showed similar deviations from the perfect line for ML-NN and logistic regression. Conclusions: ML-NN method very well predicted the composite outcome after premature discontinuation of anticoagulation and outperformed traditional logistic regression.


2015 ◽  
Vol 4 (1) ◽  
pp. 148
Author(s):  
Nahid Khorashadizade ◽  
Hassan Rezaei

<p>Hepatitis disease is caused by liver injury. Rapid diagnosis of this disease prevents its development and suffering to cirrhosis of the liver. Data mining is a new branch of science that helps physicians for proper decision making. In data mining using reduction feature and machine learning algorithms are useful for reducing the complexity of the problem and method of disease diagnosis, respectively. In this study, a new algorithm is proposed for hepatitis diagnosis according to Principal Component Analysis (PCA) and Error Minimized Extreme Learning Machine (EMELM). The algorithm includes two stages; in reduction feature phase, missing records were deleted and hepatitis dataset was normalized in [0,1] range. Thereafter, analysis of the principal component was applied for reduction feature. In classification phase, the reduced dataset is classified using EMELM. For evaluation of the algorithm, hepatitis disease dataset from UCI Machine Learning Repository (University of California) was selected. The features of this dataset reduced from 19 to 6 using PCA and the accuracy of the reduced dataset was obtained using EMELM. The results revealed that the proposed hybrid intelligent diagnosis system reached the higher classification accuracy and shorter time compared with other methods.<strong></strong></p>


Author(s):  
Stamatios-Aggelos N. Alexandropoulos ◽  
Sotiris B. Kotsiantis ◽  
Michael N. Vrahatis

AbstractA large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.


Sign in / Sign up

Export Citation Format

Share Document