Identifying Ransomware Actors in the Bitcoin Network

Mapping Intimacies ◽

10.5121/csit.2021.111201 ◽

2021 ◽

Author(s):

Siddhartha Dalal ◽

Zihe Wang ◽

Siddhanth Sabharwal

Keyword(s):

Machine Learning ◽

Test Data ◽

Prediction Accuracy ◽

The Other ◽

Common Pattern ◽

Data Set ◽

Local Clustering ◽

Illegal Activities ◽

New Algorithms

Due to the pseudo-anonymity of the Bitcoin network, users can hide behind their bitcoin addresses that can be generated in unlimited quantity, on the fly, without any formal links between them. Thus, it is being used for payment transfer by the actors involved in ransomware and other illegal activities. The other activity we consider is related to gambling since gambling is often used for transferring illegal funds. The question addressed here is that given temporally limited graphs of Bitcoin transactions, to what extent can one identify common patterns associated with these fraudulent activities and apply themto find other ransomware actors. The problem is rather complex, given that thousands of addresses can belong to the same actor without any obvious links between them and any common pattern of behavior. The main contribution of this paper is to introduce and apply new algorithms for local clustering and supervised graph machine learning for identifying malicious actors. We show that very local subgraphsof the known such actors are sufficient to differentiate between ransomware, random and gambling actors with 85%prediction accuracy on the test data set.

Download Full-text

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.21203/rs.3.rs-91905/v1 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Filip Ferdinand ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Accuracy ◽

Data Science ◽

State Of The Art ◽

Hybrid Models ◽

The Other ◽

Learning Models ◽

Comprehensive Review

Abstract This paper provides the state of the art of data science in economics. Through a novel taxonomy of applications and methods advances in data science are investigated. The data science advances are investigated in three individual classes of deep learning models, ensemble models, and hybrid models. Application domains include stock market, marketing, E-commerce, corporate banking, and cryptocurrency. Prisma method, a systematic literature review methodology is used to ensure the quality of the survey. The findings revealed that the trends are on advancement of hybrid models as more than 51% of the reviewed articles applied hybrid model. On the other hand, it is found that based on the RMSE accuracy metric, hybrid models had higher prediction accuracy than other algorithms. While it is expected the trends go toward the advancements of deep learning models.

Download Full-text

Predicting Short-Term Survival after Gross Total or Near Total Resection in Glioblastomas by Machine Learning-Based Radiomic Analysis of Preoperative MRI

Cancers ◽

10.3390/cancers13205047 ◽

2021 ◽

Vol 13 (20) ◽

pp. 5047

Author(s):

Santiago Cepeda ◽

Angel Pérez-Nuñez ◽

Sergio García-García ◽

Daniel García-Pérez ◽

Ignacio Arrese ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Tumor Resection ◽

Area Under The Curve ◽

Risk Groups ◽

Feature Reduction ◽

Short Term ◽

Data Set ◽

Term Survival ◽

Short Term Survival

Radiomics, in combination with artificial intelligence, has emerged as a powerful tool for the development of predictive models in neuro-oncology. Our study aims to find an answer to a clinically relevant question: is there a radiomic profile that can identify glioblastoma (GBM) patients with short-term survival after complete tumor resection? A retrospective study of GBM patients who underwent surgery was conducted in two institutions between January 2019 and January 2020, along with cases from public databases. Cases with gross total or near total tumor resection were included. Preoperative structural multiparametric magnetic resonance imaging (mpMRI) sequences were pre-processed, and a total of 15,720 radiomic features were extracted. After feature reduction, machine learning-based classifiers were used to predict early mortality (<6 months). Additionally, a survival analysis was performed using the random survival forest (RSF) algorithm. A total of 203 patients were enrolled in this study. In the classification task, the naive Bayes classifier obtained the best results in the test data set, with an area under the curve (AUC) of 0.769 and classification accuracy of 80%. The RSF model allowed the stratification of patients into low- and high-risk groups. In the test data set, this model obtained values of C-Index = 0.61, IBS = 0.123 and integrated AUC at six months of 0.761. In this study, we developed a reliable predictive model of short-term survival in GBM by applying open-source and user-friendly computational means. These new tools will assist clinicians in adapting our therapeutic approach considering individual patient characteristics.

Download Full-text

Data Science in Economics: Comprehensive Review of Advanced Machine Learning and Deep Learning Methods

10.20944/preprints202010.0263.v1 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Filip Ferdinand ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Accuracy ◽

Data Science ◽

State Of The Art ◽

Hybrid Models ◽

The Other ◽

Learning Models ◽

Comprehensive Review

This paper provides the state of the art of data science in economics. Through a novel taxonomy of applications and methods advances in data science are investigated. The data science advances are investigated in three individual classes of deep learning models, ensemble models, and hybrid models. Application domains include stock market, marketing, E-commerce, corporate banking, and cryptocurrency. Prisma method, a systematic literature review methodology is used to ensure the quality of the survey. The findings revealed that the trends are on advancement of hybrid models as more than 51% of the reviewed articles applied hybrid model. On the other hand, it is found that based on the RMSE accuracy metric, hybrid models had higher prediction accuracy than other algorithms. While it is expected the trends go toward the advancements of deep learning models.

Download Full-text

Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Archives of Public Health ◽

10.1186/s13690-021-00687-0 ◽

2021 ◽

Vol 79 (1) ◽

Author(s):

Romana Haneef ◽

Sofiane Kab ◽

Rok Hrzic ◽

Sonsoles Fuentes ◽

Sandrine Fosse-Edorh ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Test Data ◽

Public Health Surveillance ◽

Health Surveillance ◽

Data Sets ◽

Data Set ◽

Linear Discriminant ◽

Final Data ◽

Selection Of

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.

Download Full-text

STACKING OF THE SGTM NEURAL-LIKE STRUCTURE WITH RBF LAYER BASED ON GENERATION OF A RANDOM CURTAIN OF ITS HYPERPARAMETERS FOR PREDICTION TASKS

Ukrainian Journal of Information Technology ◽

10.23939/ujit2021.03.049 ◽

2021 ◽

Vol 3 (1) ◽

pp. 49-55

Author(s):

R. O. Tkachenko ◽

◽

I. V. Izonіn ◽

V. M. Danylyk ◽

V. Yu. Mykhalevych ◽

...

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Experimental Studies ◽

Real Data ◽

Optimal Number ◽

Individual Member ◽

Optimal Parameters ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Improving prediction accuracy by artificial intelligence tools is an important task in various industries, economics, medicine. Ensemble learning is one of the possible options to solve this task. In particular, the construction of stacking models based on different machine learning methods, or using different parts of the existing data set demonstrates high prediction accuracy of the. However, the need for proper selection of ensemble members, their optimal parameters, etc., necessitates large time costs for the construction of such models. This paper proposes a slightly different approach to building a simple but effective ensemble method. The authors developed a new model of stacking of nonlinear SGTM neural-like structures, which is based on the use of only one type of ANN as an element base of the ensemble and the use of the same training sample for all members of the ensemble. This approach provides a number of advantages over the procedures for building ensembles based on different machine learning methods, at least in the direction of selecting the optimal parameters for each of them. In our case, a tuple of random hyperparameters for each individual member of the ensemble was used as the basis of ensemble. That is, the training of each combined SGTM neural-like structure with an additional RBF layer, as a separate member of the ensemble occurs using different, randomly selected values of RBF centers and centersfof mass. This provides the necessary variety of ensemble elements. Experimental studies on the effectiveness of the developed ensemble were conducted using a real data set. The task is to predict the amount of health insurance costs based on a number of independent attributes. The optimal number of ensemble members is determined experimentally, which provides the highest prediction accuracy. The results of the work of the developed ensemble are compared with the existing methods of this class. The highest prediction accuracy of the developed ensemble at satisfactory duration of procedure of its training is established.

Download Full-text

Machine-Learning Upscales Realistic Discrete Fracture Simulations

Journal of Petroleum Technology ◽

10.2118/1121-0065-jpt ◽

2021 ◽

Vol 73 (11) ◽

pp. 65-66

Author(s):

Chris Carpenter

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Fractured Reservoirs ◽

Training Data ◽

Fine Scale ◽

Two Phase ◽

Data Set ◽

Fracture Geometry ◽

Starting Point ◽

Discrete Fracture

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 203962, “Upscaling of Realistic Discrete Fracture Simulations Using Machine Learning,” by Nikolai Andrianov, SPE, Geological Survey of Denmark and Greenland, prepared for the 2021 SPE Reservoir Simulation Conference, Galveston, Texas, 4–6 October. The paper has not been peer reviewed. Upscaling of discrete fracture networks to continuum models such as the dual-porosity/dual-permeability (DP/DP) model is an industry-standard approach in modeling fractured reservoirs. In the complete paper, the author parametrizes the fine-scale fracture geometries and assesses the accuracy of several convolutional neural networks (CNNs) to learn the mapping between this parametrization and DP/DP model closures. The accuracy of the DP/DP results with the predicted model closures was assessed by a comparison with the corresponding fine-scale discrete fracture matrix (DFM) simulation of a two-phase flow in a realistic fracture geometry. The DP/DP results matched the DFM reference solution well. The DP/DP model also was significantly faster than DFM simulation. Introduction The goal of this study was to evaluate the effect of different CNN architectures on prediction accuracy for the DP/DP model closures and on the accuracy of DP/DP simulations in comparison with fine-scale DFM simulations. As a starting point, two CNN configurations were considered that have achieved breakthrough performance in image-classification tasks. The author adopted these architectures to the problem of learning the mapping between pixelated fracture geometries and the DP/DP model closures and indicated several key features in the CNN structure that are crucial for achieving high prediction accuracy. Mapping of fracture geometries requires significant effort, which limits the possibilities for creating large training data sets with realistic fracture geometries. The author, therefore, used the synthetic random linear fractures’ data set to train the CNNs and the fracture geometry from the Lägerdorf outcrop for testing purposes. It was demonstrated that an optimal CNN configuration yielded the DP/DP model closures such that the corresponding DP/DP results matched well the two-phase DFM simulations on a subset of the Lägerdorf data. The run times for the DP/DP model were a fraction of the time needed to accomplish the DFM simulations. Problem formulation is presented in a series of equations in the complete paper.

Download Full-text

Machine-Learning Algorithms Based on Screening Tests for Mild Cognitive Impairment

American Journal of Alzheimer s Disease & Other Dementias® ◽

10.1177/1533317520927163 ◽

2020 ◽

Vol 35 ◽

pp. 153331752092716

Author(s):

Jin-Hyuck Park

Keyword(s):

Machine Learning ◽

Cognitive Impairment ◽

Mild Cognitive Impairment ◽

Test Data ◽

Learning Algorithms ◽

Test System ◽

Machine Learning Algorithms ◽

Screening Tools ◽

Screening Tests ◽

Data Set

Background: The mobile screening test system for mild cognitive impairment (mSTS-MCI) was developed and validated to address the low sensitivity and specificity of the Montreal Cognitive Assessment (MoCA) widely used clinically. Objective: This study was to evaluate the efficacy machine learning algorithms based on the mSTS-MCI and Korean version of MoCA. Method: In total, 103 healthy individuals and 74 patients with MCI were randomly divided into training and test data sets, respectively. The algorithm using TensorFlow was trained based on the training data set, and then its accuracy was calculated based on the test data set. The cost was calculated via logistic regression in this case. Result: Predictive power of the algorithms was higher than those of the original tests. In particular, the algorithm based on the mSTS-MCI showed the highest positive-predictive value. Conclusion: The machine learning algorithms predicting MCI showed the comparable findings with the conventional screening tools.

Download Full-text

Machine learning-based prediction of in-hospital mortality using admission laboratory data: A retrospective, single-site study using electronic health record data

PLoS ONE ◽

10.1371/journal.pone.0246640 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0246640

Author(s):

Tomohisa Seki ◽

Yoshimasa Kawazoe ◽

Kazuhiko Ohe

Keyword(s):

Machine Learning ◽

Risk Assessment ◽

Hospital Mortality ◽

Prediction Model ◽

Test Data ◽

Random Forests ◽

Prediction Models ◽

Laboratory Data ◽

Data Set ◽

Prediction Capability

Risk assessment of in-hospital mortality of patients at the time of hospitalization is necessary for determining the scale of required medical resources for the patient depending on the patient’s severity. Because recent machine learning application in the clinical area has been shown to enhance prediction ability, applying this technique to this issue can lead to an accurate prediction model for in-hospital mortality prediction. In this study, we aimed to generate an accurate prediction model of in-hospital mortality using machine learning techniques. Patients 18 years of age or older admitted to the University of Tokyo Hospital between January 1, 2009 and December 26, 2017 were used in this study. The data were divided into a training/validation data set (n = 119,160) and a test data set (n = 33,970) according to the time of admission. The prediction target of the model was the in-hospital mortality within 14 days. To generate the prediction model, 25 variables (age, sex, 21 laboratory test items, length of stay, and mortality) were used to predict in-hospital mortality. Logistic regression, random forests, multilayer perceptron, and gradient boost decision trees were performed to generate the prediction models. To evaluate the prediction capability of the model, the model was tested using a test data set. Mean probabilities obtained from trained models with five-fold cross-validation were used to calculate the area under the receiver operating characteristic (AUROC) curve. In a test stage using the test data set, prediction models of in-hospital mortality within 14 days showed AUROC values of 0.936, 0.942, 0.942, and 0.938 for logistic regression, random forests, multilayer perceptron, and gradient boosting decision trees, respectively. Machine learning-based prediction of short-term in-hospital mortality using admission laboratory data showed outstanding prediction capability and, therefore, has the potential to be useful for the risk assessment of patients at the time of hospitalization.

Download Full-text

Analysis of Individual Loan Defaults Using Logit under Supervised Machine Learning Approach

Asian Journal of Probability and Statistics ◽

10.9734/ajpas/2019/v3i430100 ◽

2019 ◽

pp. 1-12

Author(s):

Dominic M. Obare ◽

Gladys G. Njoroge ◽

Moses M. Muraya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Regression Model ◽

Test Data ◽

Logistic Regression Model ◽

Functional Form ◽

Supervised Machine Learning ◽

Data Set ◽

Machine Learning Approach ◽

Loan Defaults

Financial institutions have a large amount of data on their borrowers, which can be used to predict the probability of borrowers defaulting their loan or not. Some of the models that have been used to predict individual loan defaults include linear discriminant analysis models and extreme value theory models. These models are parametric in nature since they assume that the response being investigated takes a particular functional form. However, there is a possibility that the functional form used to estimate the response is very different from the actual functional form of the response. The purpose of this research was to analyze individual loan defaults in Kenya using the logistic regression model. The data used in this study was obtained from equity bank of Kenya for the period between 2006 to 2016. A random sample of 1000 loan applicants whose loans had been approved by equity bank of Kenya during this period was obtained. Data obtained was on the credit history, purpose of the loan, loan amount, nature of the saving account, employment status, sex of the applicant, age of the applicant, security used when acquiring the loan and the area of residence of the applicant (rural or urban). This study employed a quantitative research design, it deals with individual loans defaults as group characteristics of a borrower. The data was pre-processed by seeding using R- Software and then split into training dataset and test data set. The train data was used to train the logistic regression model by employing Supervised machine learning approach. The R-statistical software was used for the analysis of the data. The test data set was used to do cross-validation of the developed logistic model which later was used for analysis prediction of individual loan defaults. This study focused on the analysis of individual loan defaults in Kenya using the logistic regression model in Machine learning. The logistic regression model predicted 303 defaults from train data set, 122 non-defaults and misclassified loans were 56 and 69. The model had an accuracy of 0.7727 with the train data and 0.7333 with the test data. The logistic regression model showed a precision of 0.8440 and 0.8244 with the train and test data respectively. The performance of the model with both the train and test data was illustrated using a plot of train errors and test errors against sample size on the same axes. The plot showed that the performance of the model increases with an increase in sample size. The study recommended the use of logistic regression in conjunction with supervised machine learning approach in loan default prediction in financial institutions and also more research should be carried out on ensemble methods of loan defaults prediction in order to increase the prediction accuracy.

Download Full-text