IDENTIFY OF FRAUDULENT FINANCIAL OPERATIONS USING THE MACHINE LEARNING ALGORITHM

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2020.02.pp.023-031 ◽

2020 ◽

pp. 23-31

Author(s):

S. L. Belyakov ◽

S. М. Karpov

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Search Algorithm ◽

Random Search ◽

Experimental Comparison ◽

Unbalanced Data ◽

Ensemble Classifiers ◽

Adaptive Boosting ◽

Financial Transactions ◽

Ensemble Algorithms

Current work is devoted to the problem of automatic detection of fraudulent financial transactions. The article describes the causes of fraudulent transactions their typical attributes, as well as the basic principle of detection. The concepts of fraudulent and honest transactions are defined. Examples of algorithms for determining suspicious financial transactions in antifraud systems are given. Modern approaches to monitoring and detecting cases of fraud in remote banking systems are considered. The positive and negative aspects of each approach are described. Particular attention is paid to the problem of optimal recognition of transaction classes in highly unbalanced data. Methods for solving the problem of unbalanced data are considered. The choice of means for evaluating the operation of the machine learning model is justified considering the specifics of data distribution. As a solution, we propose an approach based on the use of ensemble classifiers in conjunction with balanced sampling algorithms, the key feature of which is to create a balanced sample not for the entire classifier, but for each student in the ensemble separately. Based on data on fraud in the field of bank credit cards, a comparison is made and the best classifier is selected among such ensemble algorithms as random forest, adaptive boosting and bagging of decision trees. To create balanced subsets of evaluators of ensemble algorithms, the algorithm of random insufficient sampling is used. To search for the optimal parameters of the classifiers, the random search algorithm on the grid is used. The results of experimental comparison of the selected methods are presented. The advantages of the proposed approach are analyzed, and the boundaries of its applicability are discussed.

Download Full-text

A Statistical Design of Experiments Approach to Machine Learning Model Selection in Engineering Applications

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4047915 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

T. Munger ◽

S. Desa

Keyword(s):

Machine Learning ◽

Model Selection ◽

Real World ◽

Ad Hoc ◽

Learning Algorithm ◽

Random Search ◽

Statistical Design ◽

Orthogonal Arrays ◽

Statistical Design Of Experiments ◽

Engineering Applications

Abstract An important but insufficiently addressed issue for machine learning in engineering applications is the task of model selection for new problems. Existing approaches to model selection generally focus on optimizing the learning algorithm and associated hyperparameters. However, in real-world engineering applications, the parameters that are external to the learning algorithm, such as feature engineering, can also have a significant impact on the performance of the model. These external parameters do not fit into most existing approaches for model selection and are therefore often studied ad hoc or not at all. In this article, we develop a statistical design of experiment (DOEs) approach to model selection based on the use of the Taguchi method. The key idea is that we use orthogonal arrays to plan a set of build-and-test experiments to study the external parameters in combination with the learning algorithm. The use of orthogonal arrays maximizes the information learned from each experiment and, therefore, enables the experimental space to be explored extremely efficiently in comparison with grid or random search methods. We demonstrated the application of the statistical DOE approach to a real-world model selection problem involving predicting service request escalation. Statistical DOE significantly reduced the number of experiments necessary to fully explore the external parameters for this problem and was able to successfully optimize the model with respect to the objective function of minimizing total cost in addition to the standard evaluation metrics such as accuracy, f-measure, and g-mean.

Download Full-text

Object Detection using Feature Mining in a Distributed Machine Learning Framework

10.51202/9783186855107 ◽

2017 ◽

Author(s):

Arne Ehlers

Keyword(s):

Machine Learning ◽

Object Detection ◽

Training Data ◽

Visual Object ◽

Ensemble Classifiers ◽

Adaptive Boosting ◽

Learning Framework ◽

Theory Of Evidence ◽

Feature Mining ◽

Distributed Machine Learning

This dissertation addresses the problem of visual object detection based on machine-learned classifiers. A distributed machine learning framework is developed to learn detectors for several object classes creating cascaded ensemble classifiers by the Adaptive Boosting algorithm. Methods are proposed that enhance several components of an object detection framework: At first, the thesis deals with augmenting the training data in order to improve the performance of object detectors learned from sparse training sets. Secondly, feature mining strategies are introduced to create feature sets that are customized to the object class to be detected. Furthermore, a novel class of fractal features is proposed that allows to represent a wide variety of shapes. Thirdly, a method is introduced that models and combines internal confidences and uncertainties of the cascaded detector using Dempster’s theory of evidence in order to increase the quality of the post-processing. ...

Download Full-text

Machine Learning Prediction of Stroke Mechanism in Embolic Strokes of Undetermined Source

Stroke ◽

10.1161/strokeaha.120.029305 ◽

2020 ◽

Vol 51 (9) ◽

Cited By ~ 1

Author(s):

Hooman Kamel ◽

Babak B. Navi ◽

Neal S. Parikh ◽

Alexander E. Merkler ◽

Peter M. Okin ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Cross Validation ◽

Learning Algorithm ◽

Random Search ◽

Model Performance ◽

Area Under The Curve ◽

Independent Set ◽

Predicted Probability ◽

Using Data

Background and Purpose: One-fifth of ischemic strokes are embolic strokes of undetermined source (ESUS). Their theoretical causes can be classified as cardioembolic versus noncardioembolic. This distinction has important implications, but the categories’ proportions are unknown. Methods: Using data from the Cornell Acute Stroke Academic Registry, we trained a machine-learning algorithm to distinguish cardioembolic versus non-cardioembolic strokes, then applied the algorithm to ESUS cases to determine the predicted proportion with an occult cardioembolic source. A panel of neurologists adjudicated stroke etiologies using standard criteria. We trained a machine learning classifier using data on demographics, comorbidities, vitals, laboratory results, and echocardiograms. An ensemble predictive method including L1 regularization, gradient-boosted decision tree ensemble (XGBoost), random forests, and multivariate adaptive splines was used. Random search and cross-validation were used to tune hyperparameters. Model performance was assessed using cross-validation among cases of known etiology. We applied the final algorithm to an independent set of ESUS cases to determine the predicted mechanism (cardioembolic or not). To assess our classifier’s validity, we correlated the predicted probability of a cardioembolic source with the eventual post-ESUS diagnosis of atrial fibrillation. Results: Among 1083 strokes with known etiologies, our classifier distinguished cardioembolic versus noncardioembolic cases with excellent accuracy (area under the curve, 0.85). Applied to 580 ESUS cases, the classifier predicted that 44% (95% credibility interval, 39%–49%) resulted from cardiac embolism. Individual ESUS patients’ predicted likelihood of cardiac embolism was associated with eventual atrial fibrillation detection (OR per 10% increase, 1.27 [95% CI, 1.03–1.57]; c-statistic, 0.68 [95% CI, 0.58–0.78]). ESUS patients with high predicted probability of cardiac embolism were older and had more coronary and peripheral vascular disease, lower ejection fractions, larger left atria, lower blood pressures, and higher creatinine levels. Conclusions: A machine learning estimator that distinguished known cardioembolic versus noncardioembolic strokes indirectly estimated that 44% of ESUS cases were cardioembolic.

Download Full-text

Comparison of Machine Learning Classifiers to Predict Patient Survival and Genetics of High-Grade Glioma: Towards a Standardized Model for Clinical Implementation (Preprint)

10.2196/preprints.32594 ◽

2021 ◽

Author(s):

Luca Pasquini ◽

Antonio Napolitano ◽

Martina Lucignani ◽

Emanuela Tagliente ◽

Francesco Dellepiane ◽

...

Keyword(s):

Machine Learning ◽

Dna Methyltransferase ◽

Search Algorithm ◽

High Grade ◽

Idh Mutation ◽

Ensemble Classifiers ◽

Clinical Implementation ◽

Operating Characteristics ◽

Ki 67 ◽

Learning Classifiers

BACKGROUND Radiomic models outperform clinical data for outcome prediction in high-grade gliomas (HGG). Many machine learning (ML) radiomic models have been developed, mostly employing single classifiers with variable results. However, comparative analyses of different ML models for clinically-relevant tasks are lacking in the literature. OBJECTIVE We aimed to compare well-established ML learning classifiers, including single and ensemble learners, to predict clinically-relevant tasks for HGG: overall survival (OS), isocitrate dehydrogenase (IDH) mutation, O-6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation, epidermal growth factor receptor (EGFR) amplification and Ki-67 expression in HGG patients, based on radiomic features from conventional and advanced MRI. Our objective was to identify the best algorithm for each task in terms of accuracy of the prediction performance. METHODS 156 adult patients with pathologic diagnosis of HGG were included. Three tumoral regions were manually segmented: contrast-enhancing tumor, necrosis and non-enhancing tumor. Radiomic features were extracted with a custom version of Pyradiomics, and selected through Boruta algorithm. A Grid Search algorithm was applied when computing 4 times K-fold cross validation (K=10) to get the highest mean and lowest spread of accuracy. Model performance was assessed as Area Under The Curve-Receiver Operating Characteristics (AUC-ROC). RESULTS Ensemble classifiers showed the best performance across tasks. xGB obtained highest accuracy for OS (74.5%), AB for IDH mutation (88%), MGMT methylation (71,7%), Ki-67 expression (86,6%), and EGFRvIII amplification (81,6%). CONCLUSIONS Best performing features shed light on possible correlations between MRI and tumor histology.

Download Full-text

Evolving Basis Function Networks for System Identification

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2001.p0229 ◽

2001 ◽

Vol 5 (4) ◽

pp. 229-238 ◽

Cited By ~ 4

Author(s):

Yuehui Chen ◽

◽

Shigeyasu Kawaji

Keyword(s):

Basis Function ◽

Learning Algorithm ◽

Search Algorithm ◽

Random Search ◽

Parameter Tuning ◽

Unified Framework ◽

B Spline ◽

Gaussian Basis Function ◽

Structure Adaptation ◽

Random Search Algorithm

This paper is concerned with learning and optimization of different basis function networks in the aspect of structure adaptation and parameter tuning. Basis function networks include the Volterra polynomial, Gaussian radial, B-spline, fuzzy, recurrent fuzzy, and local Gaussian basis function networks. Based on creation and evolution of the type constrained sparse tree, a unified framework is constructed, in which structure adaptation and parameter adjustment of different basis function networks are addressed using a hybrid learning algorithm combining a modified probabilistic incremental program evolution (MPIPE) and random search algorithm. Simulation results for the identification of nonlinear systems show the feasibility and effectiveness of the proposed method.

Download Full-text

A Novel Machine-Learning Aided Optimization Technique for Material Design: Application in Thin Film Solar Cells

Volume 2: Heat Transfer in Multiphase Systems; Gas Turbine Heat Transfer; Manufacturing and Materials Processing; Heat Transfer in Electronic Equipment; Heat and Mass Transfer in Biotechnology; Heat Transfer Under Extreme Conditions; Computational Heat Transfer; Heat Transfer Visualization Gallery; General Papers on Heat Transfer; Multiphase Flow and Heat Transfer; Transport Phenomena in Manufacturing and Materials Processing ◽

10.1115/ht2016-7306 ◽

2016 ◽

Cited By ~ 3

Author(s):

Shima Hajimirza

Keyword(s):

Thin Film ◽

Machine Learning ◽

Objective Function ◽

Search Algorithm ◽

Random Search ◽

Optimization Technique ◽

Global Search ◽

Radiative Properties ◽

Mcmc Methods ◽

Predictive Tool

Patterned thin film structures can offer spectrally selective radiative properties that benefit many engineering applications including photovoltaic energy conversion at extremely efficient scales. Inverse design of such structures can be expressed as an interesting optimization problem with a specific regime of complexity; namely moderate number of optimization parameters but highly time-consuming forward problem. For problems like this, a search technique that can somehow learn and parameterize the multi-dimensional behavior of the objective function based on past search points can be extremely useful in guiding the global search algorithm and expediting the solution for such complexity regimes. Based on this idea, we have developed a novel search algorithm for optimizing absorption coefficient of visible light in a multi-layered silicon-based nano-scale thin film solar cell. The proposed optimization algorithm uses a machine-learning predictive tool called regression-tree in an intermediary step to learn (i.e. regress) the objective function based on a previous generation of random search points. The fitted model is then used as a guide to resample from a new generation of candidate solutions with a significantly higher average gain. This process can be repeated multiple times and better solutions are obtained with high likelihood at each stage. Through numerical experiments we demonstrate how in only one resampling stage, the propose technique dominates the state-of-the-art global search algorithms such as gradient based techniques or MCMC methods in the considered nano-design problem.

Download Full-text

Dysgraphia detection through machine learning

Scientific Reports ◽

10.1038/s41598-020-78611-9 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Peter Drotár ◽

Marek Dobeš

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Well Being ◽

Machine Learning Algorithms ◽

Written Expression ◽

Learning Approach ◽

Adaptive Boosting ◽

Adaboost Algorithm ◽

Machine Learning Approach ◽

Automated Procedures

AbstractDysgraphia, a disorder affecting the written expression of symbols and words, negatively impacts the academic results of pupils as well as their overall well-being. The use of automated procedures can make dysgraphia testing available to larger populations, thereby facilitating early intervention for those who need it. In this paper, we employed a machine learning approach to identify handwriting deteriorated by dysgraphia. To achieve this goal, we collected a new handwriting dataset consisting of several handwriting tasks and extracted a broad range of features to capture different aspects of handwriting. These were fed to a machine learning algorithm to predict whether handwriting is affected by dysgraphia. We compared several machine learning algorithms and discovered that the best results were achieved by the adaptive boosting (AdaBoost) algorithm. The results show that machine learning can be used to detect dysgraphia with almost 80% accuracy, even when dealing with a heterogeneous set of subjects differing in age, sex and handedness.

Download Full-text

A New Method of Secure Authentication Based on Electromagnetic Signatures of Chipless RFID Tags and Machine Learning Approaches

Sensors ◽

10.3390/s20216385 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6385

Author(s):

Dragoș Nastasiu ◽

Răzvan Scripcaru ◽

Angela Digulescu ◽

Cornel Ioana ◽

Raymundo De Amorim ◽

...

Keyword(s):

Machine Learning ◽

Radio Frequency Identification ◽

Search Algorithm ◽

Random Search ◽

Recognition Rate ◽

Rfid Tags ◽

Learning Approaches ◽

V Band ◽

Chipless Rfid ◽

Secure Authentication

In this study, we present the implementation of a neural network model capable of classifying radio frequency identification (RFID) tags based on their electromagnetic (EM) signature for authentication applications. One important application of the chipless RFID addresses the counterfeiting threat for manufacturers. The goal is to design and implement chipless RFID tags that possess a unique and unclonable fingerprint to authenticate objects. As EM characteristics are employed, these fingerprints cannot be easily spoofed. A set of 18 tags operating in V band (65–72 GHz) was designed and measured. V band is more sensitive to dimensional variations compared to other applications at lower frequencies, thus it is suitable to highlight the differences between the EM signatures. Machine learning (ML) approaches are used to characterize and classify the 18 EM responses in order to validate the authentication method. The proposed supervised method reached a maximum recognition rate of 100%, surpassing in terms of accuracy most of RFID fingerprinting related work. To determine the best network configuration, we used a random search algorithm. Further tuning was conducted by comparing the results of different learning algorithms in terms of accuracy and loss.

Download Full-text

Nowcasting Foehn Wind Events Using the AdaBoost Machine Learning Algorithm

Weather and Forecasting ◽

10.1175/waf-d-16-0208.1 ◽

2017 ◽

Vol 32 (3) ◽

pp. 1079-1099 ◽

Cited By ~ 11

Author(s):

Michael Sprenger ◽

Sebastian Schemm ◽

Roger Oechslin ◽

Johannes Jenkner

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Learning Algorithm ◽

Weather Prediction ◽

Probability Of Detection ◽

Validation Dataset ◽

Small Scale ◽

Machine Learning Algorithm ◽

Adaptive Boosting ◽

Adaboost Algorithm

Abstract The south foehn is a characteristic downslope windstorm in the valleys of the northern Alps in Europe that demands reliable forecasts because of its substantial economic and societal impacts. Traditionally, a foehn is predicted based on pressure differences and tendencies across the Alpine ridge. Here, a new objective method for foehn prediction is proposed based on a machine learning algorithm (called AdaBoost, short for adaptive boosting). Three years (2000–02) of hourly simulations of the Consortium for Small-Scale Modeling’s (COSMO) numerical weather prediction (NWP) model and corresponding foehn wind observations are used to train the algorithm to distinguish between foehn and nonfoehn events. The predictors (133 in total) are subjectively extracted from the 7-km COSMO reanalysis dataset based on the main characteristics of foehn flows. The performance of the algorithm is then assessed with a validation dataset based on a contingency table that concisely summarizes the cooccurrence of observed and predicted (non)foehn events. The main performance measures are probability of detection (88.2%), probability of false detection (2.9%), missing rate (11.8%), correct alarm ratio (66.2%), false alarm ratio (33.8%), and missed alarm ratio (0.8%). To gain insight into the prediction model, the relevance of the single predictors is determined, resulting in a predominance of pressure differences across the Alpine ridge (i.e., similar to the traditional methods) and wind speeds at the foehn stations. The predominance of pressure-related predictors is further established in a sensitivity experiment where ~2500 predictors are objectively incorporated into the prediction model using the AdaBoost algorithm. The performance is very similar to the run with the subjectively determined predictors. Finally, some practical aspects of the new foehn index are discussed (e.g., the predictability of foehn events during the four seasons). The correct alarm rate is highest in winter (86.5%), followed by spring (79.6%), and then autumn (69.2%). The lowest rates are found in summer (51.2%).

Download Full-text

Classification of Black Plastics Waste Using Fluorescence Imaging and Machine Learning

Recycling ◽

10.3390/recycling4040040 ◽

2019 ◽

Vol 4 (4) ◽

pp. 40 ◽

Cited By ~ 1

Author(s):

Florian Gruber ◽

Wulf Grählert ◽

Philipp Wollmann ◽

Stefan Kaskel

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Near Infrared ◽

Search Algorithm ◽

Random Search ◽

Support Vector ◽

Classification Algorithms ◽

Linear Discriminant ◽

Black Plastic

This work contributes to the recycling of technical black plastic particles, for example from the automotive or electronics industries. These plastics cannot yet be sorted with sufficient purity (up to 99.9%), which often makes economical recycling impossible. As a solution to this problem, imaging fluorescence spectroscopy with additional illumination in the near infrared spectral range in combination with classification by machine learning or deep learning classification algorithms is here investigated. The algorithms used are linear discriminant analysis (LDA), k-nearest neighbour classification (kNN), support vector machines (SVM), ensemble models with decision trees (ENSEMBLE), and convolutional neural networks (CNNs). The CNNs in particular attempt to increase overall classification accuracy by taking into account the shape of the plastic particles. In addition, the automatic optimization of the hyperparameters of the classification algorithms by the random search algorithm was investigated. The aim was to increase the accuracy of the classification models. About 400 particles each of 14 plastics from 12 plastic classes were examined. An attempt was made to train an overall model for the classification of all 12 plastics. The CNNs achieved the highest overall classification accuracy with 93.5%. Another attempt was made to classify 41 mixtures of industrially relevant plastics with a maximum of three plastic classes per mixture. The same average classification accuracy of 99.0% was achieved for the ENSEMBLE, SVM, and CNN algorithms. The target overall classification accuracy of 99.9% was achieved for 18 of the 41 compounds. The results show that the method presented is a promising approach for sorting black technical plastic waste.

Download Full-text