Algorithmic fairness in credit scoring

Abstract The use of machine learning as an input into decision-making is on the rise, owing to its ability to uncover hidden patterns in large data and improve prediction accuracy. Questions have been raised, however, about the potential distributional impacts of these technologies, with one concern being that they may perpetuate or even amplify human biases from the past. Exploiting detailed credit file data for 800,000 UK borrowers, we simulate a switch from a traditional (logit) credit scoring model to ensemble machine-learning methods. We confirm that machine-learning models are more accurate overall. We also find that they do as well as the simpler traditional model on relevant fairness criteria, where these criteria pertain to overall accuracy and error rates for population subgroups defined along protected or sensitive lines (gender, race, health status, and deprivation). We do observe some differences in the way credit-scoring models perform for different subgroups, but these manifest under a traditional modelling approach and switching to machine learning neither exacerbates nor eliminates these issues. The paper discusses some of the mechanical and data factors that may contribute to statistical fairness issues in the context of credit scoring.

Download Full-text

Survey on Feature Transformation Techniques for Data Streams

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/668 ◽

2020 ◽

Author(s):

Maroua Bahri ◽

Albert Bifet ◽

Silviu Maniu ◽

Heitor Murilo Gomes

Keyword(s):

Machine Learning ◽

Data Streams ◽

Large Data ◽

High Dimensional ◽

Feature Transformation ◽

Transformation Techniques ◽

Computational Costs ◽

The Past ◽

Fundamental Challenge ◽

Mining Algorithms

Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.

Download Full-text

Credit Scoring Using Ensemble Machine Learning

2009 Ninth International Conference on Hybrid Intelligent Systems ◽

10.1109/his.2009.264 ◽

2009 ◽

Cited By ~ 2

Author(s):

Ping Yao

Keyword(s):

Machine Learning ◽

Credit Scoring ◽

Ensemble Machine Learning

Download Full-text

A Hybrid Meta-Learner Technique for Credit Scoring of Banks’ Customers

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.1361 ◽

2017 ◽

Vol 7 (5) ◽

pp. 2073-2082 ◽

Cited By ~ 1

Author(s):

A. G. Armaki ◽

M. F. Fallah ◽

M. Alborzi ◽

A. Mohammadzadeh

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Credit Scoring ◽

Clustering Algorithms ◽

Real Life ◽

Ensemble Methods ◽

Scoring Systems ◽

Error Rates ◽

Machine Learning Algorithms ◽

Machine Learning Techniques

Financial institutions are exposed to credit risk due to issuance of consumer loans. Thus, developing reliable credit scoring systems is very crucial for them. Since, machine learning techniques have demonstrated their applicability and merit, they have been extensively used in credit scoring literature. Recent studies concentrating on hybrid models through merging various machine learning algorithms have revealed compelling results. There are two types of hybridization methods namely traditional and ensemble methods. This study combines both of them and comes up with a hybrid meta-learner model. The structure of the model is based on the traditional hybrid model of ‘classification + clustering’ in which the stacking ensemble method is employed in the classification part. Moreover, this paper compares several versions of the proposed hybrid model by using various combinations of classification and clustering algorithms. Hence, it helps us to identify which hybrid model can achieve the best performance for credit scoring purposes. Using four real-life credit datasets, the experimental results show that the model of (KNN-NN-SVMPSO)-(DL)-(DBSCAN) delivers the highest prediction accuracy and the lowest error rates.

Download Full-text

Smart literature review: a practical topic modelling approach to exploratory literature review

Journal Of Big Data ◽

10.1186/s40537-019-0255-7 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 7

Author(s):

Claus Boye Asmussen ◽

Charles Møller

Keyword(s):

Machine Learning ◽

Literature Review ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Modelling ◽

Learning Methods ◽

The Past ◽

Machine Learning Methods ◽

Literature Reviews ◽

Modelling Approach

Abstract Manual exploratory literature reviews should be a thing of the past, as technology and development of machine learning methods have matured. The learning curve for using machine learning methods is rapidly declining, enabling new possibilities for all researchers. A framework is presented on how to use topic modelling on a large collection of papers for an exploratory literature review and how that can be used for a full literature review. The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template. The framework consists of three steps; pre-processing, topic modelling, and post-processing, where the topic model Latent Dirichlet Allocation is used. The framework enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

Download Full-text

Using Machine Learning Algorithms to create a Credit Scoring Model for mobile money users

2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI) ◽

10.1109/saci51354.2021.9465561 ◽

2021 ◽

Author(s):

Monica Charles Mhina ◽

Fabrice Labeau

Keyword(s):

Machine Learning ◽

Credit Scoring ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Mobile Money ◽

Scoring Model ◽

Credit Scoring Model

Download Full-text

Hierarchical ensemble learning method in diversified dataset analysis

Journal of Physics Conference Series ◽

10.1088/1742-6596/2078/1/012027 ◽

2021 ◽

Vol 2078 (1) ◽

pp. 012027

Author(s):

Ze yuan Liu ◽

Xin long Li

Keyword(s):

Machine Learning ◽

Random Forest ◽

Classification Accuracy ◽

Large Data ◽

Training Dataset ◽

Categorical Variables ◽

Ensemble Machine Learning ◽

Dataset Analysis ◽

Stage 1 ◽

Hierarchical Classifier

Abstract The remarkable advances in ensemble machine learning methods have led to a significant analysis in large data, such as random forest algorithms. However, the algorithms only use the current features during the process of learning, which caused the initial upper accuracy’s limit no matter how well the algorithms are. Moreover, the low classification accuracy happened especially when one type of observation’s proportion is much lower than the other types in training datasets. The aim of the present study is to design a hierarchical classifier which try to extract new features by ensemble machine learning regressors and statistical methods inside the whole machine learning process. In stage 1, all the categorical variables will be characterized by random forest algorithm to create a new variable through regression analysis while the numerical variables left will serve as the sample of factor analysis (FA) process to calculate the factors value of each observation. Then, all the features will be learned by random forest classifier in stage 2. Diversified datasets consist of categorical and numerical variables will be used in the method. The experiment results show that the classification accuracy increased by 8.61%. Meanwhile, it also improves the classification accuracy of observations with low proportion in the training dataset significantly.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Recent Progress in Machine Learning-based Prediction of Peptide Activity for Drug Discovery

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666190122151634 ◽

2019 ◽

Vol 19 (1) ◽

pp. 4-16 ◽

Cited By ~ 6

Author(s):

Qihui Wu ◽

Hanzhong Ke ◽

Dongli Li ◽

Qi Wang ◽

Jiansong Fang ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Large Scale ◽

Recent Progress ◽

High Specificity ◽

Learning Approaches ◽

Anticancer Peptides ◽

The Past ◽

Traditional Approaches ◽

Large Scale Screening

Over the past decades, peptide as a therapeutic candidate has received increasing attention in drug discovery, especially for antimicrobial peptides (AMPs), anticancer peptides (ACPs) and antiinflammatory peptides (AIPs). It is considered that the peptides can regulate various complex diseases which are previously untouchable. In recent years, the critical problem of antimicrobial resistance drives the pharmaceutical industry to look for new therapeutic agents. Compared to organic small drugs, peptide- based therapy exhibits high specificity and minimal toxicity. Thus, peptides are widely recruited in the design and discovery of new potent drugs. Currently, large-scale screening of peptide activity with traditional approaches is costly, time-consuming and labor-intensive. Hence, in silico methods, mainly machine learning approaches, for their accuracy and effectiveness, have been introduced to predict the peptide activity. In this review, we document the recent progress in machine learning-based prediction of peptides which will be of great benefit to the discovery of potential active AMPs, ACPs and AIPs.

Download Full-text

Fintech Credit Scoring Techniques for Evaluating P2P Loan Applications – A Python Machine Learning Ensemble Approach

International Journal of Smart Business and Technology ◽

10.21742/ijsbt.2018.6.1.04 ◽

2018 ◽

Vol 6 (1) ◽

Keyword(s):

Machine Learning ◽

Credit Scoring ◽

Ensemble Approach

Download Full-text

Race and Gender

The Oxford Handbook of Ethics of AI ◽

10.1093/oxfordhb/9780190067397.013.16 ◽

2020 ◽

pp. 251-269 ◽

Cited By ~ 2

Author(s):

Timnit Gebru

Keyword(s):

Machine Learning ◽

Language Processing ◽

The United States ◽

Error Rates ◽

Political Factors ◽

Recidivism Rates ◽

Race And Gender ◽

Decision Tools ◽

And Gender ◽

Technical Solutions

This chapter discusses the role of race and gender in artificial intelligence (AI). The rapid permeation of AI into society has not been accompanied by a thorough investigation of the sociopolitical issues that cause certain groups of people to be harmed rather than advantaged by it. For instance, recent studies have shown that commercial automated facial analysis systems have much higher error rates for dark-skinned women, while having minimal errors on light-skinned men. Moreover, a 2016 ProPublica investigation uncovered that machine learning–based tools that assess crime recidivism rates in the United States are biased against African Americans. Other studies show that natural language–processing tools trained on news articles exhibit societal biases. While many technical solutions have been proposed to alleviate bias in machine learning systems, a holistic and multifaceted approach must be taken. This includes standardization bodies determining what types of systems can be used in which scenarios, making sure that automated decision tools are created by people from diverse backgrounds, and understanding the historical and political factors that disadvantage certain groups who are subjected to these tools.

Download Full-text