Customer credit scoring using a hybrid data mining approach

Purpose A crucial decision in financial services is how to classify credit or loan applicants into good and bad applicants. The purpose of this paper is to propose a four-stage hybrid data mining approach to support the decision-making process. Design/methodology/approach The approach is inspired by the bagging ensemble learning method and proposes a new voting method, namely two-level majority voting in the last stage. First some training subsets are generated. Then some different base classifiers are tuned and afterward some ensemble methods are applied to strengthen tuned classifiers. Finally, two-level majority voting schemes help the approach to achieve more accuracy. Findings A comparison of results shows the proposed model outperforms powerful single classifiers such as multilayer perceptron (MLP), support vector machine, logistic regression (LR). In addition, it is more accurate than ensemble learning methods such as bagging-LR or rotation forest (RF)-MLP. The model outperforms single classifiers in terms of type I and II errors; it is close to some ensemble approaches such as bagging-LR and RF-MLP but fails to outperform them in terms of type I and II errors. Moreover, majority voting in the final stage provides more reliable results. Practical implications The study concludes the approach would be beneficial for banks, credit card companies and other credit provider organisations. Originality/value A novel four stages hybrid approach inspired by bagging ensemble method proposed. Moreover the two-level majority voting in two different schemes in the last stage provides more accuracy. An integrated evaluation criterion for classification errors provides an enhanced insight for error comparisons.

Download Full-text

Hybrid supervised clustering based ensemble scheme for text classification

Kybernetes ◽

10.1108/k-10-2016-0300 ◽

2017 ◽

Vol 46 (2) ◽

pp. 330-348 ◽

Cited By ~ 7

Author(s):

Aytug Onan

Keyword(s):

Ensemble Learning ◽

Text Classification ◽

Predictive Performance ◽

Majority Voting ◽

Classifier Ensemble ◽

Support Vector ◽

Classification Algorithms ◽

Learning Methods ◽

Content Type ◽

Supervised Clustering

Purpose The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design. Design/methodology/approach An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks. Findings The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification. Originality/value The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

Download Full-text

Data Mining Approach to Analyze COVID-19 Clinical Dataset

10.53350/pjmhs211561812 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1812-1819

Author(s):

Azita Yazdani ◽

Ramin Ravangard ◽

Roxana Sharifian

Keyword(s):

Artificial Intelligence ◽

Data Mining ◽

Support Vector Machine ◽

Nearest Neighbor ◽

Clinical Signs ◽

Study Data ◽

Mining Machine ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Mining Approach

The new coronavirus has been spreading since the beginning of 2020 and many efforts have been made to develop vaccines to help patients recover. It is now clear that the world needs a rapid solution to curb the spread of COVID-19 worldwide with non-clinical approaches such as data mining, enhanced intelligence, and other artificial intelligence techniques. These approaches can be effective in reducing the burden on the health care system to provide the best possible way to diagnose and predict the COVID-19 epidemic. In this study, data mining models for early detection of Covid-19 in patients were developed using the epidemiological dataset of patients and individuals suspected of having Covid-19 in Iran. C4.5, support vector machine, Naive Bayes, logistic regression, Random Forest, and k-nearest neighbor algorithm were used directly on the dataset using Rapid miner to develop the models. By receiving clinical signs, this model diagnosis the risk of contracting the COVID-19 virus. Examination of the models in this study has shown that the support vector machine with 93.41% accuracy is more efficient in the diagnosis of patients with COVID-19 pandemic, which is the best model among other developed models. Keywords: COVID-19, Data mining, Machine Learning, Artificial Intelligence, Classification

Download Full-text

Applying data mining algorithms to real estate appraisals: a comparative study

International Journal of Housing Markets and Analysis ◽

10.1108/ijhma-07-2020-0080 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Thiago Cesar de Oliveira ◽

Lúcio de Medeiros ◽

Daniel Henrique Marco Detzel

Keyword(s):

Data Mining ◽

Real Estate ◽

Support Vector ◽

Predictive Capacity ◽

Content Type ◽

Data Mining Algorithms ◽

Wide Range ◽

Very Large Databases ◽

Mining Algorithms ◽

Statistical Results

Purpose Real estate appraisals are becoming an increasingly important means of backing up financial operations based on the values of these kinds of assets. However, in very large databases, there is a reduction in the predictive capacity when traditional methods, such as multiple linear regression (MLR), are used. This paper aims to determine whether in these cases the application of data mining algorithms can achieve superior statistical results. First, real estate appraisal databases from five towns and cities in the State of Paraná, Brazil, were obtained from Caixa Econômica Federal bank. Design/methodology/approach After initial validations, additional databases were generated with both real, transformed and nominal values, in clean and raw data. Each was assisted by the application of a wide range of data mining algorithms (multilayer perceptron, support vector regression, K-star, M5Rules and random forest), either isolated or combined (regression by discretization – logistic, bagging and stacking), with the use of 10-fold cross-validation in Weka software. Findings The results showed more varied incremental statistical results with the use of algorithms than those obtained by MLR, especially when combined algorithms were used. The largest increments were obtained in databases with a large amount of data and in those where minor initial data cleaning was carried out. The paper also conducts a further analysis, including an algorithmic ranking based on the number of significant results obtained. Originality/value The authors did not find similar studies or research studies conducted in Brazil.

Download Full-text

Development of a soldering quality classifier system using a hybrid data mining approach

Expert Systems with Applications ◽

10.1016/j.eswa.2011.11.097 ◽

2012 ◽

Vol 39 (5) ◽

pp. 5727-5738 ◽

Cited By ~ 14

Author(s):

Tsung-Nan Tsai

Keyword(s):

Data Mining ◽

Classifier System ◽

Data Mining Approach ◽

Hybrid Data

Download Full-text

Optimizing the prediction accuracy of load-settlement behavior of single pile using a self-learning data mining approach

MATEC Web of Conferences ◽

10.1051/matecconf/201925802010 ◽

2019 ◽

Vol 258 ◽

pp. 02010

Author(s):

Doddy Prayogo ◽

Yudas Tadeus Teddy Susanto

Keyword(s):

Data Mining ◽

Prediction Accuracy ◽

Soil Layer ◽

Training Data ◽

Support Vector ◽

Data Mining Approach ◽

Settlement Behavior ◽

Data Points ◽

Single Piles ◽

Self Learning

Pile foundations usually are used when the upper soil layers are soft clay and, hence, unable to support the structures’ loads. Piles are needed to carry these loads deep into the hard soil layer. Therefore, the safety and stability of pile-supported structures depends on the behavior of the piles. Additionally, an accurate prediction of the piles’ behavior is very important to ensure satisfactory performance of the structures. Although many methods in the literature estimate the settlement of the piles both theoretically and experimentally, methods for comprehensively predicting the load-settlement of piles are very limited. This study develops a new data mining approach called self-learning support vector machine (SL-SVM) to predict the load-settlement behavior of single piles. SL-SVM performance is investigated using 446 training data points and 53 test data points of cone penetration test (CPT) data obtained from the previous literature. The actual prediction accuracy is then compared to other prediction methods using three statistical measurements, including mean absolute error (MAE), coefficient of correlation (R), and root mean square error (RMSE). The obtained results show that SL-SVM achieves better accuracy than does LS-SVM and BPNN. This confirms the capability of the proposed data mining method to model the accurate load-settlement behavior of single piles through CPT data. The paper proposes beneficial insights for geotechnical engineers involved in estimating pile behavior.

Download Full-text

A hybrid data-mining approach in genomics and text structures

Third IEEE International Conference on Data Mining ◽

10.1109/icdm.2003.1250999 ◽

2004 ◽

Cited By ~ 1

Author(s):

H.-N. Teodorescu ◽

L.I. Fira

Keyword(s):

Data Mining ◽

Data Mining Approach ◽

Text Structures ◽

Hybrid Data

Download Full-text

Prognosis of Diabetes Using Data mining Approach-Fuzzy C Means Clustering and Support Vector Machine

International Journal of Computer Trends and Technology ◽

10.14445/22312803/ijctt-v11p120 ◽

2014 ◽

Vol 11 (2) ◽

pp. 94-98 ◽

Cited By ~ 14

Author(s):

Ravi Sanakal ◽

◽

Smt. T Jayakumari

Keyword(s):

Data Mining ◽

Support Vector Machine ◽

Support Vector ◽

Fuzzy C Means ◽

Data Mining Approach ◽

Fuzzy C Means Clustering ◽

Using Data

Download Full-text

Empirical study on the effect of using synthetic attributes on classification algorithms

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-08-2016-0029 ◽

2017 ◽

Vol 10 (2) ◽

pp. 111-129 ◽

Cited By ~ 2

Author(s):

Ali Hasan Alsaffar

Keyword(s):

Data Mining ◽

Empirical Study ◽

Student Performance ◽

Software Tool ◽

Real Data ◽

Support Vector ◽

Classification Algorithms ◽

Past Performance ◽

Data Set ◽

Content Type

Purpose The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course, which are defined as global performance (GP) and local performance (LP). GP of a course is an aggregated performance achieved by all students who have taken this course, and LP of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course. Design/methodology/approach The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes on the classification algorithms. As a research paradigm, the paper follows Cross-Industry Standard Process for Data Mining using RapidMiner Studio software tool. Six classification algorithms are experimented: C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and support vector machines. Findings The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms, and also they have been highly ranked according to their influence to the target variable. Originality/value This paper proposes two synthetic attributes that are integrated into real data set. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.

Download Full-text

Hybrid data mining approach for pattern extraction from wafer bin map to improve yield in semiconductor manufacturing

International Journal of Production Economics ◽

10.1016/j.ijpe.2006.05.015 ◽

2007 ◽

Vol 107 (1) ◽

pp. 88-103 ◽

Cited By ~ 103

Author(s):

Shao-Chung Hsu ◽

Chen-Fu Chien

Keyword(s):

Data Mining ◽

Semiconductor Manufacturing ◽

Pattern Extraction ◽

Data Mining Approach ◽

Hybrid Data ◽

Bin Map

Download Full-text

Klasifikasi Diabetes Menggunakan Model Pembelajaran Ensemble Blending

Jurnal ULTIMATICS ◽

10.31937/ti.v10i1.709 ◽

2018 ◽

Vol 10 (1) ◽

pp. 11-15

Author(s):

Vinnia Kemala Putri ◽

Felix Indra Kurniadi

Keyword(s):

Data Mining ◽

Ensemble Classifier ◽

Medical Data ◽

Support Vector ◽

Prediction Problem ◽

Medical Databases ◽

Data Mining Approach ◽

Blending Method ◽

Diabetes Prediction ◽

Index Terms

Diabetes mellitus is one of the deadliest disease and it is increasing in occurrence through the world. This can be prevented by conducting early diagnosis and treatment. However, in developing countries, less than half of people with diabetes are diagnosed correctly which lead to lose of human lives. In this Big Data era, medical databases have enormous quantities of data about their patients. But this medical data may contain noise and a lot of useless information which may mislead the expert in making a decision for medical diagnosis. Data mining is a technique to that is very effective for medical applications for identifying patterns and extracting useful information for databases. This paper proposed a data mining approach using an ensemble blending method to tackle a diabetes prediction problem in Pima Indian Diabetes Dataset. We proposed a blending ensemble classifier approach using a combination of Decision Tree and Logistic Regression as base classifiers, and Support Vector Machine as a top blender classifier. Our approach reached accuracy of 81% and F1-score of 0.81 proves to be higher when compared with basic classifier without combination. Index Terms—diabetes, ensemble, data mining

Download Full-text