Application of selected data mining techniques in unintentional accounting error detection

Research background: Even though unintentional accounting errors leading to financial restatements look like less serious distortion of publicly available information, it has been shown that financial restatements impacts on financial markets are similar to intentional fraudulent activities. Unintentional accounting errors leading to financial restatements then affect value of company shares in the short run which negatively impacts all shareholders. Purpose of the article: The aim of this manuscript is to predict unintentional accounting errors leading to financial restatements based on information from financial statements of companies. The manuscript analysis if financial statements include sufficient information which would allow detection of unintentional accounting errors. Methods: Method of classification and regression trees (decision tree) and random forest have been used in this manuscript to fulfill the aim of this manuscript. Data sample has consisted of 400 items from financial statements of 80 selected international companies. The results of developed prediction models have been compared and explained based on their accuracy, sensitivity, specificity, precision and F1 score. Statistical relationship among variables has been tested by correlation analysis. Differences between the group of companies with and without unintentional accounting error have been tested by means of Kruskal-Wallis test. Differences among the models have been tested by Levene and T-tests. Findings & value added: The results of the analysis have provided evidence that it is possible to detect unintentional accounting errors with high levels of accuracy based on financial ratios (rather than the Beneish variables) and by application of random forest method (rather than classification and regression tree method).

Download Full-text

DETECTION MODELS FOR UNINTENTIONAL FINANCIAL RESTATEMENTS

Journal of Business Economics and Management ◽

10.3846/jbem.2019.10179 ◽

2019 ◽

Vol 21 (1) ◽

pp. 64-86

Author(s):

Mário Papík ◽

Lenka Papíková

Keyword(s):

Logistic Regression ◽

Stock Price ◽

Prediction Models ◽

Financial Statements ◽

Annual Reports ◽

Accounting Fraud ◽

Linear Discriminant ◽

Detection Model ◽

Financial Restatements ◽

Accounting Errors

The aim of manuscript is to analyze and identify determinants of honest accounting errors leading to financial restatements based on data from SEC database and from annual reports. Reason for this study is that accounting errors are expensive for companies that need to change already published financial statements and have impact on company reputation and stock price. Most of authors focus on prediction of accounting frauds and financial restatements remain in the background of research. This study initially tests existing accounting fraud detection model of Beneish on a sample of 40 financial restatement companies over 10 years and develops two new pioneer prediction models, one based on linear discriminant analysis (LDA) and another based on logistic regression. In testing dataset, LDA model has achieved accuracy 70.96%, specificity 25.00% and sensitivity 79.83% and logistic regression model has achieved accuracy 62.22%, specificity 41.66% and sensitivity 66.67%, performance of both models is better than existing Beneish model or other studies in this field. Developed models can be widely used by both internal and external users of financial statements, who would like to determine if financial statements of analyzed company include accounting errors or not, thanks to easily interpretable results in equation form.

Download Full-text

Early detection of type II Diabetes Mellitus with random forest and classification and regression tree (CART)

2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA) ◽

10.1109/icaicta.2014.7005947 ◽

2014 ◽

Cited By ~ 4

Author(s):

M. T. Mira Kania Sabariah ◽

S. T. Aini Hanifa ◽

M. T. Siti Sa'adah

Keyword(s):

Diabetes Mellitus ◽

Random Forest ◽

Early Detection ◽

Type Ii Diabetes ◽

Regression Tree ◽

Type Ii Diabetes Mellitus ◽

Classification And Regression Tree ◽

Type Ii ◽

Classification And Regression

Download Full-text

Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia

Landslides ◽

10.1007/s10346-015-0614-1 ◽

2015 ◽

Vol 13 (5) ◽

pp. 839-856 ◽

Cited By ~ 242

Author(s):

Ahmed Mohamed Youssef ◽

Hamid Reza Pourghasemi ◽

Zohre Sadat Pourtaghi ◽

Mohamed M. Al-Katheeri

Keyword(s):

Saudi Arabia ◽

Random Forest ◽

Landslide Susceptibility ◽

Linear Models ◽

Regression Tree ◽

Classification And Regression Tree ◽

Landslide Susceptibility Mapping ◽

Boosted Regression Tree ◽

Classification And Regression ◽

Asir Region

Download Full-text

A comparative study of land subsidence susceptibility mapping of Tasuj plane, Iran, using boosted regression tree, random forest and classification and regression tree methods

Environmental Earth Sciences ◽

10.1007/s12665-020-08953-0 ◽

2020 ◽

Vol 79 (10) ◽

Author(s):

Hamid Ebrahimy ◽

Bakhtiar Feizizadeh ◽

Saeed Salmani ◽

Hossein Azadi

Keyword(s):

Random Forest ◽

Comparative Study ◽

Land Subsidence ◽

Regression Tree ◽

Susceptibility Mapping ◽

Classification And Regression Tree ◽

Boosted Regression Tree ◽

Tree Methods ◽

Classification And Regression

Download Full-text

A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility

CATENA ◽

10.1016/j.catena.2016.11.032 ◽

2017 ◽

Vol 151 ◽

pp. 147-160 ◽

Cited By ~ 255

Author(s):

Wei Chen ◽

Xiaoshen Xie ◽

Jiale Wang ◽

Biswajeet Pradhan ◽

Haoyuan Hong ◽

...

Keyword(s):

Random Forest ◽

Comparative Study ◽

Landslide Susceptibility ◽

Logistic Model ◽

Regression Tree ◽

Spatial Prediction ◽

Classification And Regression Tree ◽

Tree Models ◽

Logistic Model Tree ◽

Classification And Regression

Download Full-text

Predicting Ozone Layer Concentration Using Multivariate Adaptive Regression Splines, Random Forest and Classification and Regression Tree

Soft Computing Applications - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-62524-9_11 ◽

2017 ◽

pp. 140-152 ◽

Cited By ~ 2

Author(s):

Sanjiban Sekhar Roy ◽

Chitransh Pratyush ◽

Cornel Barna

Keyword(s):

Random Forest ◽

Regression Tree ◽

Ozone Layer ◽

Multivariate Adaptive Regression Splines ◽

Classification And Regression Tree ◽

Regression Splines ◽

Adaptive Regression ◽

Classification And Regression ◽

Adaptive Regression Splines

Download Full-text

Hardware Implementation of Random Forest Algorithm Based on Classification and Regression Tree

10.1109/iciba50161.2020.9276928 ◽

2020 ◽

Author(s):

Ziheng Teng ◽

Lijian Chu ◽

Kai Chen ◽

Guoqiang He ◽

Yuxiang Fu ◽

...

Keyword(s):

Random Forest ◽

Hardware Implementation ◽

Regression Tree ◽

Classification And Regression Tree ◽

Random Forest Algorithm ◽

Classification And Regression

Download Full-text

Modeling Road Accident Severity with Comparisons of Logistic Regression, Decision Tree and Random Forest

Information ◽

10.3390/info11050270 ◽

2020 ◽

Vol 11 (5) ◽

pp. 270 ◽

Cited By ~ 1

Author(s):

Mu-Ming Chen ◽

Mu-Chen Chen

Keyword(s):

Logistic Regression ◽

Random Forest ◽

Sensitivity And Specificity ◽

Prediction Models ◽

Regression Tree ◽

Positive Influence ◽

Road Accident ◽

Prediction Performance ◽

Classification And Regression Tree ◽

Accident Severity

To reduce the damage caused by road accidents, researchers have applied different techniques to explore correlated factors and develop efficient prediction models. The main purpose of this study is to use one statistical and two nonparametric data mining techniques, namely, logistic regression (LR), classification and regression tree (CART), and random forest (RF), to compare their prediction capability, identify the significant variables (identified by LR) and important variables (identified by CART or RF) that are strongly correlated with road accident severity, and distinguish the variables that have significant positive influence on prediction performance. In this study, three prediction performance evaluation measures, accuracy, sensitivity and specificity, are used to find the best integrated method which consists of the most effective prediction model and the input variables that have higher positive influence on accuracy, sensitivity and specificity.

Download Full-text

Comparing Ensemble-Based Machine Learning Classifiers Developed for Distinguishing Hypokinetic Dysarthria from Presbyphonia

Applied Sciences ◽

10.3390/app11052235 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2235

Author(s):

Haewon Byeon

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Area Under The Curve ◽

Regression Tree ◽

Prediction Performance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Hypokinetic Dysarthria ◽

Order Of Magnitude

It is essential to understand the voice characteristics in the normal aging process to accurately distinguish presbyphonia from neurological voice disorders. This study developed the best ensemble-based machine learning classifier that could distinguish hypokinetic dysarthria from presbyphonia using classification and regression tree (CART), random forest, gradient boosting algorithm (GBM), and XGBoost and compared the prediction performance of models. The subjects of this study were 76 elderly patients diagnosed with hypokinetic dysarthria and 174 patients with presbyopia. This study developed prediction models for distinguishing hypokinetic dysarthria from presbyphonia by using CART, GBM, XGBoost, and random forest and compared the accuracy, sensitivity, and specificity of the development models to identify the prediction performance of them. The results of this study showed that random forest had the best prediction performance when it was tested with the test dataset (accuracy = 0.83, sensitivity = 0.90, and specificity = 0.80, and area under the curve (AUC) = 0.85). The main predictors for detecting hypokinetic dysarthria were Cepstral peak prominence (CPP), jitter, shimmer, L/H ratio, L/H ratio_SD, CPP max (dB), CPP min (dB), and CPPF0 in the order of magnitude. Among them, CPP was the most important predictor for identifying hypokinetic dysarthria.

Download Full-text