Credit Risk Rating Using State Machines and Machine Learning

Credit risk is the possibility of a loss resulting from a borrower’s failure to repay a loan or meet contractual obligations. With the growing number of customers and expansion of businesses, it’s not possible or at least feasible for banks to assess each customer individually in order to minimize this risk. Machine learning can leverage available user data to model a behavior and automatically estimate a credit score for each customer. In this research, we propose a novel approach based on state machines to model this problem into a classical supervised machine learning task. The proposed state machine is used to convert historical user data to a credit score which generates a data-set for training supervised models. We have explored several classification models in our experiments and illustrated the effectiveness of our modeling approach.

Download Full-text

Exploring fake news identification using word and sentence embeddings

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189865 ◽

2021 ◽

pp. 1-8

Author(s):

V.T Priyanga ◽

J.P Sanjanasri ◽

Vijay Krishna Menon ◽

E.A Gopalakrishnan ◽

K.P Soman

Keyword(s):

Machine Learning ◽

Social Media ◽

Network Analysis ◽

Supervised Machine Learning ◽

Breeding Ground ◽

Fake News ◽

Data Set ◽

Highly Correlated ◽

Use Of Social Media ◽

The Liar

The widespread use of social media like Facebook, Twitter, Whatsapp, etc. has changed the way News is created and published; accessing news has become easy and inexpensive. However, the scale of usage and inability to moderate the content has made social media, a breeding ground for the circulation of fake news. Fake news is deliberately created either to increase the readership or disrupt the order in the society for political and commercial benefits. It is of paramount importance to identify and filter out fake news especially in democratic societies. Most existing methods for detecting fake news involve traditional supervised machine learning which has been quite ineffective. In this paper, we are analyzing word embedding features that can tell apart fake news from true news. We use the LIAR and ISOT data set. We churn out highly correlated news data from the entire data set by using cosine similarity and other such metrices, in order to distinguish their domains based on central topics. We then employ auto-encoders to detect and differentiate between true and fake news while also exploring their separability through network analysis.

Download Full-text

A New Model Averaging Approach in Predicting Credit Risk Default

Risks ◽

10.3390/risks9060114 ◽

2021 ◽

Vol 9 (6) ◽

pp. 114

Author(s):

Paritosh Navinchandra Jha ◽

Marco Cucculelli

Keyword(s):

Machine Learning ◽

Credit Risk ◽

Model Averaging ◽

Research Direction ◽

Future Research ◽

Classification Problems ◽

Average Technique ◽

Model Average ◽

Novel Approach ◽

Unbalanced Dataset

The paper introduces a novel approach to ensemble modeling as a weighted model average technique. The proposed idea is prudent, simple to understand, and easy to implement compared to the Bayesian and frequentist approach. The paper provides both theoretical and empirical contributions for assessing credit risk (probability of default) effectively in a new way by creating an ensemble model as a weighted linear combination of machine learning models. The idea can be generalized to any classification problems in other domains where ensemble-type modeling is a subject of interest and is not limited to an unbalanced dataset or credit risk assessment. The results suggest a better forecasting performance compared to the single best well-known machine learning of parametric, non-parametric, and other ensemble models. The scope of our approach can be extended to any further improvement in estimating weights differently that may be beneficial to enhance the performance of the model average as a future research direction.

Download Full-text

Leveraging Road Characteristics and Contributor Behaviour for Assessing Road Type Quality in OSM

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070436 ◽

2021 ◽

Vol 10 (7) ◽

pp. 436

Author(s):

Amerah Alghanim ◽

Musfira Jilani ◽

Michela Bertolotto ◽

Gavin McArdle

Keyword(s):

Machine Learning ◽

Spatial Data ◽

Classification Accuracy ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Set ◽

Semantic Inference ◽

Road Type ◽

The Impact

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.

Download Full-text

Sentiment Analysis on UAV-aided Product Comments Based on Machine Learning: From Sentence to Document Level

10.21203/rs.3.rs-104009/v1 ◽

2020 ◽

Author(s):

JINGYANG CAO ◽

Shirong Yin ◽

Guoxu Zhang

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Accurate Result ◽

Supervised Machine Learning ◽

Hotel Management ◽

Novel Approach ◽

Online Comments ◽

New Perspective ◽

The Relationship ◽

Document Level

Abstract This paper presents a novel approach to analyze the sentiment of the product comments from sentence to document level and apply to the customers sentiment analysis on UAV-aided product comments for hotel management. In order to realize the effiffifficient sentiment analysis, a cascaded sentence-to-document sentiment classifification method is investigated. Initially, a supervised machine learning method is applied to explore the sentiment polarity of the sentence (SPS). Afterward, the contribution of the sentence to document (CSD) is calculated by using various statistical algorithms. Lastly, the sentiment polarity of the document (SPD) is determined by the SPS as well as its contribution. Comparative experiments have been established on the basis of hotel online comments, and the outcomes indicate that the proposed method not only raises the effiffifficiency in attaining a more accurate result but also assists immensely in regards to the B5G wireless communication supported by the UAV. The fifindings provide a new perspective that sentence position and its sentiment similarity with document (sentiment condition) dramatically disclose the relationship between sentence and document.

Download Full-text

On the Influence of Contextual Features for the Identification of Complex Words

International Journal of Semantic Computing ◽

10.1142/s1793351x17400207 ◽

2017 ◽

Vol 11 (04) ◽

pp. 497-511

Author(s):

Elnaz Davoodi ◽

Leila Kosseim ◽

Matthew Mongrain

Keyword(s):

Machine Learning ◽

Natural Language ◽

Target Word ◽

Supervised Machine Learning ◽

Learning Models ◽

Data Set ◽

Contextual Features ◽

Complex Words ◽

Machine Learning Models

This paper evaluates the effect of the context of a target word on the identification of complex words in natural language texts. The approach automatically tags words as either complex or not, based on two sets of features: base features that only pertain to the target word, and contextual features that take the context of the target word into account. We experimented with several supervised machine learning models, and trained and tested the approach with the 2016 SemEval Word Complexity Data Set. Results show that when discriminating base features are used, the words around the target word can supplement those features and improve the recognition of complex words.

Download Full-text

A hierarchical approach to mood classification in blogs

Natural Language Engineering ◽

10.1017/s1351324911000118 ◽

2011 ◽

Vol 18 (1) ◽

pp. 61-81 ◽

Cited By ~ 11

Author(s):

FAZEL KESHTKAR ◽

DIANA INKPEN

Keyword(s):

Machine Learning ◽

Error Analysis ◽

Learning Approach ◽

Hierarchical Approach ◽

Data Set ◽

Novel Approach ◽

Machine Learning Approach ◽

Sentiment Orientation ◽

Mood Classification

AbstractIn this article, we explore the task of mood classification for blog postings. We propose a novel approach that uses the hierarchy of possible moods to achieve better results than a standard machine learning approach. We also show that using sentiment orientation features improves the performance of classification. We used the Livejournal blog corpus as a data set to train and evaluate our method. We present extensive error analysis and discuss the difficulty of the task.

Download Full-text

Pixel-based machine learning and image reconstitution for dot-ELISA pathogen serodiagnosis

10.1101/2020.03.18.997320 ◽

2020 ◽

Author(s):

Cleo Anastassopoulou ◽

Athanasios Tsakris ◽

George P. Patrinos ◽

Yiannis Manoussopoulos

Keyword(s):

Machine Learning ◽

False Negative ◽

Color Discrimination ◽

Supervised Machine Learning ◽

Multivariate Logistic Regression Model ◽

Enzyme Linked Immunosorbent Assay ◽

Novel Approach ◽

Image Pixels ◽

Diagnostic Applications ◽

Dot Elisa

AbstractSerological methods serve as a direct or indirect means of pathogen infection diagnosis in plant and animal species, including humans. Dot-ELISA (DE) is an inexpensive and sensitive, solid-state version of the microplate enzyme-linked immunosorbent assay, with a broad range of applications in epidemiology. Yet, its applicability is limited by uncertainties in the qualitative output of the assay due to overlapping dot colorations of positive and negative samples, stemming mainly from the inherent color discrimination thresholds of the human eye. Here, we report a novel approach for unambiguous DE output evaluation by applying machine learning-based pattern recognition of image pixels of the blot using an impartial predictive model rather than human judgment. Supervised machine learning was used to train a classifier algorithm through a built multivariate logistic regression model based on the RGB (“Red”, “Green”, “Blue”) pixel attributes of a scanned DE output of samples of known infection status to a model pathogen (Lettuce big-vein associated virus). Based on the trained and cross-validated algorithm, pixel probabilities of unknown samples could be predicted in scanned DE output images which would then be reconstituted by pixels having probabilities above a cutoff that may be selected at will to yield desirable false positive and false negative rates depending on the question at hand, thus allowing for proper dot classification of positive and negative samples and, hence, accurate diagnosis. Potential improvements and diagnostic applications of the proposed versatile method that translates unique pathogen antigens to the universal basic color language are discussed.

Download Full-text

A Research Travelogue on Classification Algorithms using R Programming

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9014.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 9155-9158

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Statistical Tests ◽

Learning Task ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Domain Experts ◽

R Programming ◽

Training Examples

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.

Download Full-text

Bituminous Mixtures Experimental Data Modeling Using a Hyperparameters-Optimized Machine Learning Approach

Applied Sciences ◽

10.3390/app112411710 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11710

Author(s):

Matteo Miani ◽

Matteo Dunnhofer ◽

Fabio Rondinella ◽

Evangelos Manthos ◽

Jan Valentin ◽

...

Keyword(s):

Machine Learning ◽

Bayesian Optimization ◽

Automatic Identification ◽

Learning Approach ◽

Ann Model ◽

Data Set ◽

Bituminous Mixtures ◽

Novel Approach ◽

The Neural Network ◽

Machine Learning Approach

This study introduces a machine learning approach based on Artificial Neural Networks (ANNs) for the prediction of Marshall test results, stiffness modulus and air voids data of different bituminous mixtures for road pavements. A novel approach for an objective and semi-automatic identification of the optimal ANN’s structure, defined by the so-called hyperparameters, has been introduced and discussed. Mechanical and volumetric data were obtained by conducting laboratory tests on 320 Marshall specimens, and the results were used to train the neural network. The k-fold Cross Validation method has been used for partitioning the available data set, to obtain an unbiased evaluation of the model predictive error. The ANN’s hyperparameters have been optimized using the Bayesian optimization, that overcame efficiently the more costly trial-and-error procedure and automated the hyperparameters tuning. The proposed ANN model is characterized by a Pearson coefficient value of 0.868.

Download Full-text

Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia

10.1101/170670 ◽

2017 ◽

Cited By ~ 4

Author(s):

Daniel R. Schrider ◽

Julien Ayroles ◽

Daniel R. Matute ◽

Andrew D. Kern

Keyword(s):

Machine Learning ◽

Gene Flow ◽

Drosophila Simulans ◽

Supervised Machine Learning ◽

Data Set ◽

Learning Framework ◽

Drosophila Sechellia ◽

Taxonomic Groups ◽

Genomic Regions ◽

New Statistics

ABSTRACTHybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia.AUTHOR SUMMARYUnderstanding the extent to which species or diverged populations hybridize in nature is crucially important if we are to understand the speciation process. Accordingly numerous research groups have developed methodology for finding the genetic evidence of such introgression. In this report we develop a supervised machine learning approach for uncovering loci which have introgressed across species boundaries. We show that our method, FILET, has greater accuracy and power than competing methods in discovering introgression, and in addition can detect the directionality associated with the gene flow between species. Using whole genome sequences from Drosophila simulans and Drosophila sechellia we show that FILET discovers quite extensive introgression between these species that has occurred mostly from D. simulans to D. sechellia. Our work highlights the complex process of speciation even within a well-studied system and points to the growing importance of supervised machine learning in population genetics.

Download Full-text