scholarly journals Empirical Analysis of Financial Statement Fraud of Listed Companies Based on Logistic Regression and Random Forest Algorithm

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xinchun Liu

Financial supervision plays an important role in the construction of market economy, but financial data has the characteristics of being nonstationary and nonlinear and low signal-to-noise ratio, so an effective financial detection method is needed. In this paper, two machine learning algorithms, decision tree and random forest, are used to detect the company's financial data. Firstly, based on the financial data of 100 sample listed companies, this paper makes an empirical study on the fraud of financial statements of listed companies by using machine learning technology. Through the empirical analysis of logistic regression, gradient lifting decision tree, and random forest model, the preliminary results are obtained, and then the random forest model is used for secondary judgment. This paper constructs an efficient, accurate, and simple comprehensive application model of machine learning. The empirical results show that the comprehensive application model constructed in this paper has an accuracy of 96.58% in judging the abnormal financial data of listed companies. The paper puts forward an accurate and practical method for capital market participants to identify the fraud of financial statements of listed companies and has certain practical significance for investors and securities research institutions to deal with the fraud of financial statements.

2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
R. Shashikant ◽  
P. Chetankumar

Cardiac arrest is a severe heart anomaly that results in billions of annual casualties. Smoking is a specific hazard factor for cardiovascular pathology, including coronary heart disease, but data on smoking and heart death not earlier reviewed. The Heart Rate Variability (HRV) parameters used to predict cardiac arrest in smokers using machine learning technique in this paper. Machine learning is a method of computing experience based on automatic learning and enhances performances to increase prognosis. This study intends to compare the performance of logistical regression, decision tree, and random forest model to predict cardiac arrest in smokers. In this paper, a machine learning technique implemented on the dataset received from the data science research group MITU Skillogies Pune, India. To know the patient has a chance of cardiac arrest or not, developed three predictive models as 19 input feature of HRV indices and two output classes. These model evaluated based on their accuracy, precision, sensitivity, specificity, F1 score, and Area under the curve (AUC). The model of logistic regression has achieved an accuracy of 88.50%, precision of 83.11%, the sensitivity of 91.79%, the specificity of 86.03%, F1 score of 0.87, and AUC of 0.88. The decision tree model has arrived with an accuracy of 92.59%, precision of 97.29%, the sensitivity of 90.11%, the specificity of 97.38%, F1 score of 0.93, and AUC of 0.94. The model of the random forest has achieved an accuracy of 93.61%, precision of 94.59%, the sensitivity of 92.11%, the specificity of 95.03%, F1 score of 0.93 and AUC of 0.95. The random forest model achieved the best accuracy classification, followed by the decision tree, and logistic regression shows the lowest classification accuracy.


Author(s):  
Krishna Kumar Mohbey

In any industry, attrition is a big problem, whether it is about employee attrition of an organization or customer attrition of an e-commerce site. If we can accurately predict which customer or employee will leave their current company or organization, then it will save much time, effort, and cost of the employer and help them to hire or acquire substitutes in advance, and it would not create a problem in the ongoing progress of an organization. In this chapter, a comparative analysis between various machine learning approaches such as Naïve Bayes, SVM, decision tree, random forest, and logistic regression is presented. The presented result will help us in identifying the behavior of employees who can be attired over the next time. Experimental results reveal that the logistic regression approach can reach up to 86% accuracy over other machine learning approaches.


Cardiovascular diseases are one of the main causes of mortality in the world. A proper prediction mechanism system with reasonable cost can significantly reduce this death toll in the low-income countries like Bangladesh. For those countries we propose machine learning backed embedded system that can predict possible cardiac attack effectively by excluding the high cost angiogram and incorporating only twelve (12) low cost features which are age, sex, chest pain, blood pressure, cholesterol, blood sugar, ECG results, heart rate, exercise induced angina, old peak, slope, and history of heart disease. Here, two heart disease datasets of own built NICVD (National Institute of Cardiovascular Disease, Bangladesh) patients’, and UCI (University of California Irvin) are used. The overall process comprises into four phases: Comprehensive literature review, collection of stable angina patients’ data through survey questionnaires from NICVD, feature vector dimensionality is reduced manually (from 14 to 12 dimensions), and the reduced feature vector is fed to machine learning based classifiers to obtain a prediction model for the heart disease. From the experiments, it is observed that the proposed investigation using NICVD patient’s data with 12 features without incorporating angiographic disease status to Artificial Neural Network (ANN) shows better classification accuracy of 92.80% compared to the other classifiers Decision Tree (82.50%), Naïve Bayes (85%), Support Vector Machine (SVM) (75%), Logistic Regression (77.50%), and Random Forest (75%) using the 10-fold cross validation. To accommodate small scale training and test data in our experimental environment we have observed the accuracy of ANN, Decision Tree, Naïve Bayes, SVM, Logistic Regression and Random Forest using Jackknife method, which are 84.80%, 71%, 75.10%, 75%, 75.33% and 71.42% respectively. On the other hand, the classification accuracies of the corresponding classifiers are 91.7%, 76.90%, 86.50%, 76.3%, 67.0% and 67.3%, respectively for the UCI dataset with 12 attributes. Whereas the same dataset with 14 attributes including angiographic status shows the accuracies 93.5%, 76.7%, 86.50%, 76.8%, 67.7% and 69.6% for the respective classifiers


Author(s):  
M. Nirmala

Abstract: Data Mining in Educational System has increased tremendously in the past and still increasing in present era. This study focusses on the academic stand point and the performance of the student is evaluated by various parameters such as Scholastic Features, Demographic Features and Emotional Features are carried out. Various Machine learning methodologies are adopted to extract the masked knowledge from the educational data set provided, which helps in identifying the features giving more impact to the student academic performance and there by knowing the impacting features, helps us to predict deeper insights about student performance in academics. Various Machine learning workflow starting from problem definition to Model Prediction has been carried out in this study. The supervised learning methodology has been adopted and various Feature engineering methods has been adopted to make the ML model appropriate for training and evaluation. It is a prediction problem and various Classification algorithms such as Logistic Regression, Random Forest, SVM, KNN, XGBOOST, Decision Tree modelling has been done to fit the student data appropriately. Keywords: Scholastic, Demographic, Emotional, Logistic Regression, Random Forest, SVM, KNN, XGBOOST, Decision Tree.


2021 ◽  
Author(s):  
Hemalatha N ◽  
Akhil Wilson ◽  
Akhil Thankachan

Plastic pollution is one of the challenging problems in the environment. But a life without plastic we cannot imagine. This paper deals with the prediction of plastic degrading microbes using Machine Learning. Here we have used Decision Tree, Random Forest, Support vector Machine and K Nearest Neighbor algorithms in order to predict the plastic degrading microbes. Among the four classifiers, Random Forest model gave the best accuracy of 99.1%.


MATICS ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 21-27
Author(s):  
Via Ardianto Nugroho ◽  
Derry Pramono Adi ◽  
Achmad Teguh Wibowo ◽  
MY Teguh Sulistyono ◽  
Agustinus Bimo Gumelar

Pada industri jasa pelayanan peti kemas, Terminal Nilam merupakan pelanggan dari PT. BIMA, yang secara khusus bergerak dibidang jasa perbaikan dan perawatan alat berat. Terminal ini menjadi sentral tempat untuk melakukan aktifitas bongkar muat peti kemas domestik yang memiliki empat buah container crane untuk melayani dua kapal. Proses perawatan alat berat seperti container crane yang selama ini beroperasi, agaknya kurang memperhatikan data pengelompokkan atau klasifikasi jenis perawatan yang dibutuhkan oleh alat berat tersebut. Di kemudian hari, alat berat dapat menunjukkan kinerja yang tidak maksimal bahkan dapat berujung pada kecelakaan kerja. Selain itu, kelalaian perawatan container crane juga dapat menyebabkan pembengkakan biaya perawatan lanjut. Target produksi bongkar muat dapat berkurang dan juga keterlambatan jadwal kapal sandar sangat mungkin terjadi. Metode pembelajaran menggunakan mesin atau biasa disebut dengan Machine Learning (ML), dengan mudah dapat melenyapkan kemungkinan-kemungkinan tersebut. ML dalam penelitian ini, kami rancang agar bekerja dengan mengidentifikasi lalu mengelompokkan jenis perawatan container crane yang sesuai, yaitu ringan atau berat. Metode ML yang pilih untuk digunakan dalam penelitian ini yaitu Random Forest, Support Vector Machine, k-Nearest Neighbor, Naïve Bayes, Logistic Regression, J48, dan Decision Tree. Penelitian ini menunjukkan keberhasilan ML model tree dalam melakukan pembelajaran jenis data perawatan container crane (numerik dan kategoris), dengan J48 menunjukkan performa terbaik dengan nilai akurasi dan nilai ROC-AUC mencapai 99,1%. Pertimbangan klasifikasi kami lakukan dengan mengacu kepada tanggal terakhir perawatan, hour meter, breakdown, shutdown, dan sparepart.


2019 ◽  
Vol 8 (4) ◽  
pp. 1477-1483

With the fast moving technological advancement, the internet usage has been increased rapidly in all the fields. The money transactions for all the applications like online shopping, banking transactions, bill settlement in any industries, online ticket booking for travel and hotels, Fees payment for educational organization, Payment for treatment to hospitals, Payment for super market and variety of applications are using online credit card transactions. This leads to the fraud usage of other accounts and transaction that result in the loss of service and profit to the institution. With this background, this paper focuses on predicting the fraudulent credit card transaction. The Credit Card Transaction dataset from KAGGLE machine learning Repository is used for prediction analysis. The analysis of fraudulent credit card transaction is achieved in four ways. Firstly, the relationship between the variables of the dataset is identified and represented by the graphical notations. Secondly, the feature importance of the dataset is identified using Random Forest, Ada boost, Logistic Regression, Decision Tree, Extra Tree, Gradient Boosting and Naive Bayes classifiers. Thirdly, the extracted feature importance if the credit card transaction dataset is fitted to Random Forest classifier, Ada boost classifier, Logistic Regression classifier, Decision Tree classifier, Extra Tree classifier, Gradient Boosting classifier and Naive Bayes classifier. Fourth, the Performance Analysis is done by analyzing the performance metrics like Accuracy, FScore, AUC Score, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Results shows that the Decision Tree classifier have achieved the effective prediction with the precision of 1.0, recall of 1.0, FScore of 1.0 , AUC Score of 89.09 and Accuracy of 99.92%.


2021 ◽  
Vol 5 (1) ◽  
pp. 35
Author(s):  
Uttam Narendra Thakur ◽  
Radha Bhardwaj ◽  
Arnab Hazra

Disease diagnosis through breath analysis has attracted significant attention in recent years due to its noninvasive nature, rapid testing ability, and applicability for patients of all ages. More than 1000 volatile organic components (VOCs) exist in human breath, but only selected VOCs are associated with specific diseases. Selective identification of those disease marker VOCs using an array of multiple sensors are highly desirable in the current scenario. The use of efficient sensors and the use of suitable classification algorithms is essential for the selective and reliable detection of those disease markers in complex breath. In the current study, we fabricated a noble metal (Au, Pd and Pt) nanoparticle-functionalized MoS2 (Chalcogenides, Sigma Aldrich, St. Louis, MO, USA)-based sensor array for the selective identification of different VOCs. Four sensors, i.e., pure MoS2, Au/MoS2, Pd/MoS2, and Pt/MoS2 were tested under exposure to different VOCs, such as acetone, benzene, ethanol, xylene, 2-propenol, methanol and toluene, at 50 °C. Initially, principal component analysis (PCA) and linear discriminant analysis (LDA) were used to discriminate those seven VOCs. As compared to the PCA, LDA was able to discriminate well between the seven VOCs. Four different machine learning algorithms such as k-nearest neighbors (kNN), decision tree, random forest, and multinomial logistic regression were used to further identify those VOCs. The classification accuracy of those seven VOCs using KNN, decision tree, random forest, and multinomial logistic regression was 97.14%, 92.43%, 84.1%, and 98.97%, respectively. These results authenticated that multinomial logistic regression performed best between the four machine learning algorithms to discriminate and differentiate the multiple VOCs that generally exist in human breath.


2021 ◽  
Vol 2096 (1) ◽  
pp. 012190
Author(s):  
E V Bunyaeva ◽  
I V Kuznetsov ◽  
Y V Ponomarchuk ◽  
P S Timosh

Abstract The paper considers comparative analysis results of the machine learning methods used for the gesture recognition based on the surface single-channel electromyography (sEMG) data. The data were processed using multilayer perceptron, support vector machine, decision tree ensemble (Random Forest) and logistic regression for the chosen four gesture types. The conclusion was derived on the analysis efficiency of these methods using commonly recommended accuracy metrics.


Author(s):  
Joshua J. Levy ◽  
A. James O’Malley

AbstractBackgroundMachine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each.MethodsWe present three simple examples to illustrate when to use each modeling approach and posit a general framework for combining them into an enhanced logistic regression model building procedure that aids interpretation. We study 556 benchmark machine learning datasets to uncover when machine learning techniques outperformed rudimentary logistic regression models and so are potentially well-equipped to enhance them. We illustrate a software package, InteractionTransformer, which embeds logistic regression with advanced model building capacity by using machine learning algorithms to extract candidate interaction features from a random forest model for inclusion in the model. Finally, we apply our enhanced logistic regression analysis to two real-word biomedical examples, one where predictors vary linearly with the outcome and another with extensive second-order interactions.ResultsPreliminary statistical analysis demonstrated that across 556 benchmark datasets, the random forest approach significantly outperformed the logistic regression approach. We found a statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output.ConclusionsWhen a random forest model is closer to the true model, hybrid statistical-machine learning procedures can substantially enhance the performance of statistical procedures in an automated manner while preserving easy interpretation of the results. Such hybrid methods may help facilitate widespread adoption of machine learning techniques in the biomedical setting.


Sign in / Sign up

Export Citation Format

Share Document