scholarly journals Machine learning models for bank reviews classification

2021 ◽  
pp. 1-14
Author(s):  
Natalya Dmitriyevna Badanina ◽  
Vladimir Anatolievich Sudakov

Using the banking products and services review corpus, analysis is conducted to establish different text classification models. The paper explores different approaches to the processing of unstructured textual information. Based on the selected approaches, the review corpus on banking products and services received during the COVID-19 pandemic is analyzed. An automatic Internet resources parser has been developed to obtain the required training sample. Software has been developed that implemens basic methods for the classification models construction. This model can be used to create system for monitoring people’s attitudes to banking processes.

2019 ◽  
Vol 14 (2) ◽  
pp. 97-106
Author(s):  
Ning Yan ◽  
Oliver Tat-Sheung Au

Purpose The purpose of this paper is to make a correlation analysis between students’ online learning behavior features and course grade, and to attempt to build some effective prediction model based on limited data. Design/methodology/approach The prediction label in this paper is the course grade of students, and the eigenvalues available are student age, student gender, connection time, hits count and days of access. The machine learning model used in this paper is the classical three-layer feedforward neural networks, and the scaled conjugate gradient algorithm is adopted. Pearson correlation analysis method is used to find the relationships between course grade and the student eigenvalues. Findings Days of access has the highest correlation with course grade, followed by hits count, and connection time is less relevant to students’ course grade. Student age and gender have the lowest correlation with course grade. Binary classification models have much higher prediction accuracy than multi-class classification models. Data normalization and data discretization can effectively improve the prediction accuracy of machine learning models, such as ANN model in this paper. Originality/value This paper may help teachers to find some clue to identify students with learning difficulties in advance and give timely help through the online learning behavior data. It shows that acceptable prediction models based on machine learning can be built using a small and limited data set. However, introducing external data into machine learning models to improve its prediction accuracy is still a valuable and hard issue.


SOIL ◽  
2020 ◽  
Vol 6 (2) ◽  
pp. 565-578
Author(s):  
Wartini Ng ◽  
Budiman Minasny ◽  
Wanderson de Sousa Mendes ◽  
José Alexandre Melo Demattê

Abstract. The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (< 1000), PLSR and Cubist performed better than CNN. The performance of CNN outweighed the PLSR and Cubist model at a sample size of 1500 and 1800, respectively. It can be recommended that deep learning is most efficient for spectra modelling for sample sizes above 2000. The accuracy of the PLSR and Cubist model seems to reach a plateau above sample sizes of 4200 and 5000, respectively, while the accuracy of CNN has not plateaued. A sensitivity analysis of the CNN model demonstrated its ability to determine important wavelengths region that affected the predictions of various soil attributes.


2020 ◽  
Author(s):  
Aviel J. Stein ◽  
Janith Weerasinghe ◽  
Spiros Mancoridis ◽  
Rachel Greenstadt

News articles are important for providing timely, historic information. However, the Internet is replete with text that may contain irrelevant or unhelpful information, therefore means of processing it and distilling content is important and useful to human readers as well as information extracting tools. Some common questions we may want to answer are “what is this article about?” and “who wrote it?”. In this work we compare machine learning models for evaluating two common NLP tasks, topic and authorship attribution, on the 2017 Vox Media dataset. Additionally, we use the models to classify on a subsection, about ~20%, of the original text which show to be better for classification than the provided blurbs. Because of the large number of topics, we take into account topic overlap and address it via top-n accuracy and hierarchical groupings of topics. We also consider edge cases in authorship by classifying on inter-topic and intra-topic author distributions. Our results show that both topics and authors readily identifiable consistently perform best when using neural networks rather than support vector, random forests, or naive Bayes classifiers, although the latter methods perform acceptably.


Author(s):  
Ahmad Freij ◽  

In this paper, we have proposed two models of marketing classification which are Support Vector Machine (SVM) and Linear regression, these two models are the most popular and useful models of classification. In this paper, we represent how these two models are used for a case study of a bank marketing campaign, the dataset is related to a bank marketing campaign, and for Applying the machine learning models of classification, the RapidMiner software was used.


2021 ◽  
Vol 2096 (1) ◽  
pp. 012174
Author(s):  
G D Asyaev

Abstract The paper presents an approach that allows increasing the training sample and reducing class imbalance for traffic classification problems. The basic principles and architecture of generative adversarial networks are considered. The mathematical model of network traffic classification is described. The training sample taken to solve the problem has been analyzed. The data proprocessing is carried out and justified. An architecture of the generative-adversarial network is constructed and an algorithm for generating new features is developed. Machine learning models for traffic classification problem were considered and built: Logistic regression, k Nearest Neighbors, Decision tree, Random forest. A comparative analysis of the results of machine learning models without and with the generation of new features is conducted. The obtained results can be applied both in the tasks of network traffic classification, and in general cases of multiclass classification and exclusion of unbalanced features.


2020 ◽  
Vol 2 (1) ◽  
pp. 3-6
Author(s):  
Eric Holloway

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.


Sign in / Sign up

Export Citation Format

Share Document