Establishing Causality to Detect Fraud in Financial Statements

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2021.15.166 ◽

2021 ◽

Vol 15 ◽

pp. 1534-1544

Author(s):

Kiran Maka ◽

S. Pazhanirajan ◽

Sujata Mallapur

Keyword(s):

Machine Learning ◽

Random Forest ◽

Financial Statements ◽

Ensemble Method ◽

Financial Statement ◽

The Ensemble Method

In this work, two approaches have been presented to derive the important variables that an auditor should watch out for during the audit trials of a financial statement. To achieve this goal, machine learning modeling is leveraged. In the first approach, important features or variables are derived based on ensemble method and in the second approach, an explainable model is used to corroborate and expand the conclusions derived from the ensemble method. A dataset of financial statements that was labeled manually is utilized for this purpose. Four important measures, namely, random forest recommendations of first approach, random Forest Explaner -pvalue, random Forest Explainer-first multi-way importance plot and random Forest Explainer-second multi-way importance plot, are employed to derive the important features. A final list of six variables is derived from these two approaches and four measures

Download Full-text

Stuck Pipe Early Detection on Extended Reach Wells Using Ensemble Method of Machine Learning

10.2118/206516-ms ◽

2021 ◽

Author(s):

Rushad Ravilievich Rakhimov ◽

Oleg Valerievich Zhdaneev ◽

Konstantin Nikolaevich Frolov ◽

Maxim Pavlovich Babich

Keyword(s):

Machine Learning ◽

Real Time ◽

Historical Data ◽

Learning Algorithm ◽

Ensemble Method ◽

Machine Learning Algorithm ◽

Time Data ◽

Machine Learning Model ◽

Real Time Data ◽

The Ensemble Method

Abstract The ultimate objective of this paper is to describe the experience of using a machine learning model prepared by the ensemble method to prevent stuck pipe events during well construction process on extended reach wells. The tasks performed include collecting, analyzing and cleaning historical data, selecting and preparing a machine learning model, testing it on real-time data by means of desktop application. The idea is to display the solution at the rig floor, allowing Driller to quickly take actions for prevention of stuck pipe event. Historical data mining and analysis were performed using software for remote monitoring. Preparation, labelling and cleaning of historical and real-time data were executed using programmable scripts and big data techniques. The machine learning algorithm was developed using the ensemble method, which allows to combine several models to improve the final result. On the field of interest, the most common type of stuck pipe are solids induced pack offs. They occur due to insufficient hole cleaning from drilled cuttings and wellbore collapse due to rocks instability. Stuck pipe prevention on extended reach drilling (ERD) wells requires holistic approach meanwhile final role is assigned to the driller. Due to continuously exceeding ERD envelope and increased workloads on both personnel and drilling equipment, the effectiveness of preventing accidents is deteriorating. This leads to severe consequences: Bottom Hole Assembly lost in hole, the necessity to re-drill the bore and eventually to increased Non-Productive Time (NPT). Developed application based on ensemble machine learning algorithm shows prediction accuracy above 94%. Reacting on alarms, driller can quickly take measures to prevent downhole accidents during well construction of ERD wells.

Download Full-text

Analysis of Univariate and Multivariate Filters Towards the Early Detection of Dementia

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200930163857 ◽

2020 ◽

Vol 13 ◽

Author(s):

Deepika Bansal ◽

Kavita Khanna ◽

Rita Chhikara ◽

Rakesh Kumar Dua ◽

Rajeev Malhotra

Keyword(s):

Feature Selection ◽

Random Forest ◽

Naive Bayes ◽

Computational Cost ◽

Naïve Bayes ◽

T Test ◽

Ensemble Method ◽

Selection Methods ◽

Ensemble Approach ◽

The Ensemble Method

Objective: Dementia is a progressive neurodegenerative brain disease emerging as a global health problem in adults aged 65 years or above, resulting in the death of nerve cells. The elimination of redundant and irrelevant features from the datasets is however very necessary for accurate detection and thus the timely treatment of dementia. Methods: For this purpose, an ensemble approach of univariate and multivariate feature selection methods has been proposed in this study. A comparison of four univariate feature selection techniques (t-Test, Wilcoxon, Entropy and ROC) and six multivariate feature selection approaches (ReliefF, Bhattacharyya, CFSSubsetEval, ClassifierAttributeEval, CorrelationAttributeEval, OneRAttributeEval) has been performed. The ensemble of best univariate & multivariate filter algorithms is proposed which helps in acquiring a subset of features that includes only relevant and non-redundant features. The classification is performed using Naïve Bayes, k-NN, and Random Forest algorithms. Results: Experimental Results show that t-Test & ReliefF feature selection is capable of selecting 10 relevant features that give the same accuracy as when all features are considered. In addition to it, the accuracy obtained using k-NN with an ensemble approach is 99.96%. The statistical significance of the method has been established using Friedman’s statistical test. Conclusion: The new ranking criteria computed by the ensemble method efficiently eliminate the insignificant features and reduces the computational cost of the algorithm. The ensemble method has been compared to the other approaches for ensuring the superiority of the proposed model. Discussion: The percentage gain in accuracy for all three classifiers, Naïve Bayes, k-NN, and Random Forest is shown in There is a remarkable difference noted down for the percentage gain in the accuracies after applying feature selection using Naïve Bayes and k-NN. Using univariate filter selection methods, the t-test is outshining among all the methods while selecting only 10 feature subsets.

Download Full-text

Analysis of Heart Disease Using Parallel and Sequential Ensemble Methods With Feature Selection Techniques

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.20210101.oa4 ◽

2021 ◽

Vol 6 (1) ◽

pp. 40-56

Author(s):

Dhyan Chandra Yadav ◽

Saurabh Pal

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Decision Tree ◽

Classification Accuracy ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Ensemble Method ◽

Gradient Boosting ◽

High Classification Accuracy

This paper has organized a heart disease-related dataset from UCI repository. The organized dataset describes variables correlations with class-level target variables. This experiment has analyzed the variables by different machine learning algorithms. The authors have considered prediction-based previous work and finds some machine learning algorithms did not properly work or do not cover 100% classification accuracy with overfitting, underfitting, noisy data, residual errors on base level decision tree. This research has used Pearson correlation and chi-square features selection-based algorithms for heart disease attributes correlation strength. The main objective of this research to achieved highest classification accuracy with fewer errors. So, the authors have used parallel and sequential ensemble methods to reduce above drawback in prediction. The parallel and serial ensemble methods were organized by J48 algorithm, reduced error pruning, and decision stump algorithm decision tree-based algorithms. This paper has used random forest ensemble method for parallel randomly selection in prediction and various sequential ensemble methods such as AdaBoost, Gradient Boosting, and XGBoost Meta classifiers. In this paper, the experiment divides into two parts: The first part deals with J48, reduced error pruning and decision stump and generated a random forest ensemble method. This parallel ensemble method calculated high classification accuracy 100% with low error. The second part of the experiment deals with J48, reduced error pruning, and decision stump with three sequential ensemble methods, namely AdaBoostM1, XG Boost, and Gradient Boosting. The XG Boost ensemble method calculated better results or high classification accuracy and low error compare to AdaBoostM1 and Gradient Boosting ensemble methods. The XG Boost ensemble method calculated 98.05% classification accuracy, but random forest ensemble method calculated high classification accuracy 100% with low error.

Download Full-text

Empirical Analysis of Financial Statement Fraud of Listed Companies Based on Logistic Regression and Random Forest Algorithm

Journal of Mathematics ◽

10.1155/2021/9241338 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Xinchun Liu

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Tree ◽

Empirical Analysis ◽

Financial Statements ◽

Listed Companies ◽

Financial Data ◽

Forest Model ◽

Application Model

Financial supervision plays an important role in the construction of market economy, but financial data has the characteristics of being nonstationary and nonlinear and low signal-to-noise ratio, so an effective financial detection method is needed. In this paper, two machine learning algorithms, decision tree and random forest, are used to detect the company's financial data. Firstly, based on the financial data of 100 sample listed companies, this paper makes an empirical study on the fraud of financial statements of listed companies by using machine learning technology. Through the empirical analysis of logistic regression, gradient lifting decision tree, and random forest model, the preliminary results are obtained, and then the random forest model is used for secondary judgment. This paper constructs an efficient, accurate, and simple comprehensive application model of machine learning. The empirical results show that the comprehensive application model constructed in this paper has an accuracy of 96.58% in judging the abnormal financial data of listed companies. The paper puts forward an accurate and practical method for capital market participants to identify the fraud of financial statements of listed companies and has certain practical significance for investors and securities research institutions to deal with the fraud of financial statements.

Download Full-text

Fraud of Financial Statements at Listed Enterprises on Ho Chi Minh City Securities Department

VNU Journal of Science Economics and Business ◽

10.25073/2588-1108/vnueab.4129 ◽

2018 ◽

Vol 34 (4) ◽

Author(s):

Nguyen Tien Hung ◽

Huynh Van Sau

Keyword(s):

New York ◽

Financial Reporting ◽

Financial Statements ◽

Financial Statement ◽

Viet Nam ◽

Fraud Triangle ◽

Fraudulent Financial Reporting ◽

Financial Statement Fraud ◽

Corporate Fraud ◽

Two Factors

The study was conducted to identify fraudulent financial statements at listed companies (DNNY) on the Ho Chi Minh City Stock Exchange (HOSE) through the Triangular Fraud Platform This is a test of VSA 240. At the same time, the conformity assessment of this model in the Vietnamese market. The results show that the model is based on two factors: the ratio of sales to total assets and return on assets; an Opportunity Factor (Education Level); and two factors Attitude (change of independent auditors and opinion of independent auditors). This model is capable of accurately forecasting more than 78% of surveyed sample businesses and nearly 72% forecasts for non-research firms. Keywords Triangle fraud, financial fraud report, VSA 240 References Nguyễn Tiến Hùng & Võ Hồng Đức (2017), “Nhận diện gian lận báo cáo tài chính: Bằng chứng thực nghiệm tại các doanh nghiệp niêm yết ở Việt Nam”, Tạp chí Công Nghệ Ngân Hàng, số 132 (5), tr. 58-72.[2]. Hà Thị Thúy Vân (2016), “Thủ thuật gian lận trong lập báo cáo tài chính của các công ty niêm yết”, Tạp chí tài chính, kỳ 1, tháng 4/2016 (630). [3]. Cressey, D. R. (1953). Other people's money; a study of the social psychology of embezzlement. New York, NY, US: Free Press.[4]. Bộ Tài Chính Việt Nam, (2012). Chuẩn mực kiểm toán Việt Nam số 240 – Trách nhiệm của kiểm toán viên đối với gian lận trong kiểm toán báo cáo tài chính. [5]. Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial behavior, agency costs and ownership structure. Journal of financial economics, 3(4), 305-360.[6]. Võ Hồng Đức & Phan Bùi Gia Thủy (2014), Quản trị công ty: Lý thuyết và cơ chế kiểm soát, Ấn bản lần 1, Tp.HCM, Nxb Thanh Niên.[7]. Freeman, R. E. (1984). Strategic management: A stakeholder approach. Boston: Pitman independence on corporate fraud. Managerial Finance 26 (11): 55-67.[9]. Skousen, C. J., Smith, K. R., & Wright, C. J. (2009). Detecting and predicting financial statement fraud: The effectiveness of the fraud triangle and SAS No. 99. Available at SSRN 1295494.[10]. Lou, Y. I., & Wang, M. L. (2011). Fraud risk factor of the fraud triangle assessing the likelihood of fraudulent financial reporting. Journal of Business and Economics Research (JBER), 7(2).[11]. Perols, J. L., & Lougee, B. A. (2011). The relation between earnings management and financial statement fraud. Advances in Accounting, 27(1), 39-53.[12]. Trần Thị Giang Tân, Nguyễn Trí Tri, Đinh Ngọc Tú, Hoàng Trọng Hiệp và Nguyễn Đinh Hoàng Uyên (2014), “Đánh giá rủi ro gian lận báo cáo tài chính của các công ty niêm yết tại Việt Nam”, Tạp chí Phát triển kinh tế, số 26 (1) tr.74-94.[13]. Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications, 32(4), 995-1003.[14]. Amara, I., Amar, A. B., & Jarboui, A. (2013). Detection of Fraud in Financial Statements: French Companies as a Case Study. International Journal of Academic Research in Accounting, Finance and Management Sciences, 3(3), 40-51.[15]. Beasley, M. S. (1996). An empirical analysis of the relation between the board of director composition and financial statement fraud. Accounting Review, 443-465.[16]. Beneish, M. D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24-36.[17]. Persons, O. S. (1995). Using financial statement data to identify factors associated with fraudulent financial reporting. Journal of Applied Business Research (JABR), 11(3), 38-46.[18]. Summers, S. L., & Sweeney, J. T. (1998). Fraudulently misstated financial statements and insider trading: An empirical analysis. Accounting Review, 131-146.[19]. Dechow, P. M., Sloan, R. G., & Sweeney, A. P. (1996). Causes and consequences of earnings manipulation: An analysis of firms subject to enforcement actions by the SEC. Contemporary accounting research, 13(1), 1-36.[20]. Loebbecke, J. K., Eining, M. M., & Willingham, J. J. (1989). Auditors experience with material irregularities – Frequency, nature, and detectability. Auditing – A journal of practice and Theory, 9(1), 1-28. [21]. Abbott, L. J., Park, Y., & Parker, S. (2000). The effects of audit committee activity and independence on corporate fraud. Managerial Finance, 26(11), 55-68.[22]. Farber, D. B. (2005). Restoring trust after fraud: Does corporate governance matter?. The Accounting Review, 80(2), 539-561.[23]. Stice, J. D. (1991). Using financial and market information to identify pre-engagement factors associated with lawsuits against auditors. Accounting Review, 516-533.[24]. Beasley, M. S., Carcello, J. V., & Hermanson, D. R. (1999). COSO's new fraud study: What it means for CPAs. Journal of Accountancy, 187(5), 12.[25]. Neter, J., Wasserman, W., & Kutner, M. H. (1990). Applied statistical models.Richard D. Irwin, Inc., Burr Ridge, IL.[26]. Gujarati, D. N. (2009). Basic econometrics. Tata McGraw-Hill Education.[27]. McFadden, D. (1974). Conditional Logit Analysis of Qualita-tive Choice Behavior," in Frontiers in Econometrics, P. Zarenm-bka, ed. New York: Academic Press, 105-42.(1989). A Method of Simulated Moments for Estimation of Discrete Response Models Without Numerical Integration," Econometrica, 54(3), 1027-1058.[28]. DA Cohen, ADey, TZ Lys. (2008), “Accrual-Based Earnings Management in the Pre-and Post-Sarbanes-Oxley Periods”. The accounting review.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

FRAUD TRIANGLE ANALYSES IN DETECTING FRAUDULENT FINANCIAL STATEMENT USING FRAUD SCORE MODEL

Journal of Business Economics ◽

10.35760/eb.2020.v25i1.2240 ◽

2020 ◽

Vol 25 (1) ◽

pp. 29-44

Author(s):

Mariati ◽

Emmy Indrayani

Keyword(s):

Financial Stability ◽

Linear Regression Analysis ◽

Financial Statements ◽

Multiple Linear Regression Analysis ◽

External Pressure ◽

Financial Statement ◽

Fraud Triangle ◽

Financial Statement Fraud ◽

Financial Condition ◽

Score Model

Company’s financial condition reflected in the financial statements. However, there are many loopholes in the financial statements which can become a chance for the management and certain parties to commit fraud on the financial statements. This study aims to detect financial statement fraud as measured using fraud score model that occurred in issuers entered into the LQ-45 index in 2014-2016 with the use of six independent variables are financial stability, external pressure, financial target, nature of industry, ineffective monitoring and rationalization. This study using 27 emiten of LQ-45 index during 2014-2016. However, there are some data outlier that shall be removed, thus sample results obtained 66 data from 25 companies. Multiple linear regression analysis were used in this study. The results showed that the financial stability variables (SATA), nature of industry (RECEIVBLE), ineffective monitoring (IND) and rationalization (ITRENDLB) proved to be influential or have the capability to detect financial statement fraud. While the external pressure variables (DER) and financial target (ROA) are not able to detect the existence of financial statement fraud. Simultaneously all variables in this study were able to detect significantly financial statement fraud.

Download Full-text

PENGARUH CAR, LDR DAN ROA TERHADAP LABA PERUSAHAAN PERBANKAN TERDAFTAR DI BURSA EFEK INDONESIA

Journal of Business Economics ◽

10.35760/eb.2018.v23i1.1813 ◽

2018 ◽

Vol 23 (1) ◽

pp. 72-85

Author(s):

Lasminisih ◽

Emmy Indrayani

Keyword(s):

Decision Making ◽

Multiple Regression ◽

Stock Exchange ◽

Financial Statements ◽

Capital Adequacy ◽

Financial Statement ◽

Return On Assets ◽

Capital Adequacy Ratio ◽

A Company ◽

Future Plans

Company financial statement can be used to monitor the performance of a company. Financial statements are also used as a means for decision making so that the company can anticipate future plans. The purpose of this study was to find out the effect of Capital Adequacy Ratio (CAR), Loan to Deposit Ratio (LDR) and Return on Assets (ROA) on profit changes percentage of Banking Companies. The number of sample companies used in this study was 27 Banks listed in the Indonesia Stock Exchange with observation periods from 2007 to 2008. The method used in this study was multiple regression. The results of this study have indicated that CAR, LDR, and ROA gave significant effects on changes in Banks profit so that Banking Companies performances can be measured. Keywords: CAR, LDR, ROA, Profit

Download Full-text

A Study on Host Tropism Determinants of Influenza Virus Using Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666191104160927 ◽

2020 ◽

Vol 15 (2) ◽

pp. 121-134 ◽

Cited By ~ 2

Author(s):

Eunmi Kwon ◽

Myeongji Cho ◽

Hayeon Kim ◽

Hyeon S. Son

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Influenza Virus ◽

Random Forest ◽

Physicochemical Properties ◽

Protein Sequences ◽

Influenza Viruses ◽

Host Tropism ◽

Post Hoc ◽

Ha Protein

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.

Download Full-text

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study (Preprint)

10.2196/preprints.11728 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Random Forest ◽

Precocious Puberty ◽

Prediction Models ◽

Central Precocious Puberty ◽

Machine Learning Algorithms ◽

Stimulation Test ◽

Gnrh Analogue ◽

Prediction Probability

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

Download Full-text