Intelligent Prediction Mathematical Model of Industrial Financial Fraud Based on Data Mining

The essence of enterprise financial modeling is to use mathematical models to classify and sort out all kinds of enterprise information according to the main line of value creation and on this basis to complete the analysis, prediction, and value evaluation of enterprise financial situation. A reasonable financial model is also an effective means to reduce financial fraud. In this paper, a financial fraud identification model is constructed based on empirical data. In the process of model construction, the primary feature set is selected according to the financial fraud motivation theory, and then, the original feature set is obtained by Mann–Whitney test on the primary feature set, and the final fraud identification feature set is selected from the original feature set by using Relief and Boruta algorithms. Finally, based on the final fraud identification feature set, the data algorithms such as decision tree, logistic regression, support vector machine, and random forest are used to identify financial fraud. The experimental results show that the combination of financial fraud identification features constructed by the Relief algorithm and random forest model has the best recognition effect. The evaluation indexes of the G mean value and the F value were 75.86% and 78.33%, respectively.

Download Full-text

A Machine Learning-based System for Financial Fraud Detection

10.5753/eniac.2021.18250 ◽

2021 ◽

Author(s):

João Paulo A. Andrade ◽

Leonardo S. Paulucio ◽

Thiago M. Paixão ◽

Rodrigo F. Berriel ◽

Teresa Cristina Janes Carneiro ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Nearest Neighbors ◽

Financial Data ◽

Support Vector ◽

Financial Fraud ◽

K Nearest Neighbors ◽

Governmental Agencies ◽

A Company

Companies created for money-laundering or as a means for taxevasion are harmful to the country's economy and society. This problem is usually tackled by governmental agencies by having officials to pore over companies' financial data and to single out those that exhibit fraudulent behavior. Such work tends to be slow-paced and tedious. This paper proposes a machine learning-based system capable of classifying whether a company is likely to be involved in fraud or not. Based on financial and tax data from various companies, four different classifiers – k-Nearest Neighbors, Random Forest, Support Vector Machine (SVM), and a Neural Network – were trained and then used to indicate fraud. The best-performing model achieved a macro-averaged F1-score of 92.98% with the Random Forest.

Download Full-text

Investigating the use of random forest, gradient boosting machine, support vector machine and their ensemble applied to fault detection

10.26678/abcm.cobem2017.cob17-1600 ◽

2017 ◽

Author(s):

Luis Felipe Nogoseke ◽

Gabriel Herman Bernardim Andrade ◽

Marco Boaretto ◽

Leandro Coelho

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Fault Detection ◽

Gradient Boosting ◽

Support Vector ◽

Gradient Boosting Machine

Download Full-text

Extraction of Arecanut Planting Distribution Based on the Feature Space Optimization of PlanetScope Imagery

Agriculture ◽

10.3390/agriculture11040371 ◽

2021 ◽

Vol 11 (4) ◽

pp. 371

Author(s):

Yu Jin ◽

Jiawei Guo ◽

Huichun Ye ◽

Jinling Zhao ◽

Wenjiang Huang ◽

...

Keyword(s):

Random Forest ◽

Satellite Imagery ◽

Feature Space ◽

Kappa Coefficient ◽

Classification Model ◽

Support Vector ◽

Textural Feature ◽

Monitoring Accuracy ◽

Areca Catechu ◽

High Level

The remote sensing extraction of large areas of arecanut (Areca catechu L.) planting plays an important role in investigating the distribution of arecanut planting area and the subsequent adjustment and optimization of regional planting structures. Satellite imagery has previously been used to investigate and monitor the agricultural and forestry vegetation in Hainan. However, the monitoring accuracy is affected by the cloudy and rainy climate of this region, as well as the high level of land fragmentation. In this paper, we used PlanetScope imagery at a 3 m spatial resolution over the Hainan arecanut planting area to investigate the high-precision extraction of the arecanut planting distribution based on feature space optimization. First, spectral and textural feature variables were selected to form the initial feature space, followed by the implementation of the random forest algorithm to optimize the feature space. Arecanut planting area extraction models based on the support vector machine (SVM), BP neural network (BPNN), and random forest (RF) classification algorithms were then constructed. The overall classification accuracies of the SVM, BPNN, and RF models optimized by the RF features were determined as 74.82%, 83.67%, and 88.30%, with Kappa coefficients of 0.680, 0.795, and 0.853, respectively. The RF model with optimized features exhibited the highest overall classification accuracy and kappa coefficient. The overall accuracy of the SVM, BPNN, and RF models following feature optimization was improved by 3.90%, 7.77%, and 7.45%, respectively, compared with the corresponding unoptimized classification model. The kappa coefficient also improved. The results demonstrate the ability of PlanetScope satellite imagery to extract the planting distribution of arecanut. Furthermore, the RF is proven to effectively optimize the initial feature space, composed of spectral and textural feature variables, further improving the extraction accuracy of the arecanut planting distribution. This work can act as a theoretical and technical reference for the agricultural and forestry industries.

Download Full-text

The transferability of random forest and support vector machine for estimating daily global solar radiation using sunshine duration over different climate zones

Theoretical and Applied Climatology ◽

10.1007/s00704-021-03726-6 ◽

2021 ◽

Author(s):

Wei Wu ◽

Mao-Fen Li ◽

Xia Xu ◽

Xiao-Ping Tang ◽

Chao Yang ◽

...

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Solar Radiation ◽

Sunshine Duration ◽

Global Solar Radiation ◽

Support Vector ◽

Climate Zones

Download Full-text

Machine Learning-Based Prediction of Air Quality

Applied Sciences ◽

10.3390/app10249151 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9151

Author(s):

Yun-Chia Liang ◽

Yona Maimury ◽

Angela Hsiang-Ling Chen ◽

Josue Rodolfo Cuevas Juarez

Keyword(s):

Machine Learning ◽

Air Quality ◽

Random Forest ◽

Prediction Models ◽

Superior Performance ◽

Support Vector ◽

Economic Activities ◽

Adaptive Boosting ◽

Series Of Experiments ◽

Artificial Neural Network Ann

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.

Download Full-text

BioSignal modelling for prediction of cardiac diseases using intra group selection method

Intelligent Decision Technologies ◽

10.3233/idt-200058 ◽

2021 ◽

Vol 15 (1) ◽

pp. 151-160

Author(s):

Hemant P. Kasturiwale ◽

Sujata N. Kale

Keyword(s):

Machine Learning ◽

Nervous System ◽

Random Forest ◽

Normal Sinus Rhythm ◽

Heart Defects ◽

Support Vector ◽

Autonomous Nervous System ◽

Cardiac Diseases ◽

Vast Number ◽

Proposed Model

The Autonomous Nervous System (ANS) controls the nervous system and Heart Rate Variability (HRV) can be used as a diagnostic tool to diagnose heart defects. HRV can be classified into linear and nonlinear HRV indices which are used mostly to measure the efficiency of the model. For prediction of cardiac diseases, the selection and extraction features of machine learning model are effective. The available model used till date is based on HRV indices to predict the cardiac diseases accurately. The model could hardly throw light on specifics of indices, selection process and stability of the model. The proposed model is developed considering all facet electrocardiogram amplitude (ECG), frequency components, sampling frequency, extraction methods and acquisition techniques. The machine learning based model and its performance shall be tested using the standard BioSignal method, both on the data available and on the data obtained by the author. This is unique model developed by considering the vast number of mixtures sets and more than four complex cardiac classes. The statistical analysis is performed on a variety of databases such as MIT/BIH Normal Sinus Rhythm (NSR), MIT/BIH Arrhythmia (AR) and MIT/BIH Atrial Fibrillation (AF) and Peripheral Pule Analyser using feature compatibility techniques. The classifiers are trained for prediction with approximately 40000 sets of parameters. The proposed model reaches an average accuracy of 97.87 percent and is sensitive and précised. The best features are chosen from the different HRV features that will be used for classification. The present model was checked under all possible subject scenarios, such as the raw database and the non-ECG signal. In this sense, robustness is defined not only by the specificity parameter, but also by other measuring output parameters. Support Vector Machine (SVM), K-nearest Neighbour (KNN), Ensemble Adaboost (EAB) with Random Forest (RF) are tested in a 5% higher precision band and a lower band configuration. The Random Forest has produced better results, and its robustness has been established.

Download Full-text

A Methodology Based on FT-IR Data Combined with Random Forest Model to Generate Spectralprints for the Characterization of High-Quality Vinegars

Foods ◽

10.3390/foods10061411 ◽

2021 ◽

Vol 10 (6) ◽

pp. 1411

Author(s):

José Luis P. Calle ◽

Marta Ferreiro-González ◽

Ana Ruiz-Rodríguez ◽

Gerardo F. Barbero ◽

José Á. Álvarez ◽

...

Keyword(s):

Random Forest ◽

Raw Materials ◽

Principal Component ◽

Hierarchical Cluster ◽

Raw Material ◽

Support Vector ◽

Protected Designation Of Origin ◽

Ft Ir

Sherry wine vinegar is a Spanish gourmet product under Protected Designation of Origin (PDO). Before a vinegar can be labeled as Sherry vinegar, the product must meet certain requirements as established by its PDO, which, in this case, means that it has been produced following the traditional solera and criadera ageing system. The quality of the vinegar is determined by many factors such as the raw material, the acetification process or the aging system. For this reason, mainly producers, but also consumers, would benefit from the employment of effective analytical tools that allow precisely determining the origin and quality of vinegar. In the present study, a total of 48 Sherry vinegar samples manufactured from three different starting wines (Palomino Fino, Moscatel, and Pedro Ximénez wine) were analyzed by Fourier-transform infrared (FT-IR) spectroscopy. The spectroscopic data were combined with unsupervised exploratory techniques such as hierarchical cluster analysis (HCA) and principal component analysis (PCA), as well as other nonparametric supervised techniques, namely, support vector machine (SVM) and random forest (RF), for the characterization of the samples. The HCA and PCA results present a clear grouping trend of the vinegar samples according to their raw materials. SVM in combination with leave-one-out cross-validation (LOOCV) successfully classified 100% of the samples, according to the type of wine used for their production. The RF method allowed selecting the most important variables to develop the characteristic fingerprint (“spectralprint”) of the vinegar samples according to their starting wine. Furthermore, the RF model reached 100% accuracy for both LOOCV and out-of-bag (OOB) sets.

Download Full-text