Bitcoin and Cybersecurity: Temporal Dissection of Blockchain Data to Unveil Changes in Entity Behavioral Patterns

The Bitcoin network not only is vulnerable to cyber-attacks but currently represents the most frequently used cryptocurrency for concealing illicit activities. Typically, Bitcoin activity is monitored by decreasing anonymity of its entities using machine learning-based techniques, which consider the whole blockchain. This entails two issues: first, it increases the complexity of the analysis requiring higher efforts and, second, it may hide network micro-dynamics important for detecting short-term changes in entity behavioral patterns. The aim of this paper is to address both issues by performing a “temporal dissection” of the Bitcoin blockchain, i.e., dividing it into smaller temporal batches to achieve entity classification. The idea is that a machine learning model trained on a certain time-interval (batch) should achieve good classification performance when tested on another batch if entity behavioral patterns are similar. We apply cascading machine learning principles—a type of ensemble learning applying stacking techniques—introducing a “k-fold cross-testing” concept across batches of varying size. Results show that blockchain batch size used for entity classification could be reduced for certain classes (Exchange, Gambling, and eWallet) as classification rates did not vary significantly with batch size; suggesting that behavioral patterns did not change significantly over time. Mixer and Market class detection, however, can be negatively affected. A deeper analysis of Mining Pool behavior showed that models trained on recent data perform better than models trained on older data, suggesting that “typical” Mining Pool behavior may be represented better by recent data. This work provides a first step towards uncovering entity behavioral changes via temporal dissection of blockchain data.

Download Full-text

Dear Watch, Should I get a COVID Test? Designing deployable machine learning for wearables

10.21203/rs.3.rs-505984/v1 ◽

2021 ◽

Author(s):

Anna Goldenberg ◽

Bret Nestor ◽

Jaryd Hunter ◽

Raghu Kainkaryam ◽

Erik Drysdale ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Real World ◽

False Positive Rate ◽

Wearable Devices ◽

Classification Performance ◽

Wearable Device ◽

Screening Tools ◽

Machine Learning Model ◽

Positive Rate

Abstract Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 +/- 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.

Download Full-text

A Novel PCA-Firefly Based XGBoost Classification Model for Intrusion Detection in Networks Using GPU

Electronics ◽

10.3390/electronics9020219 ◽

2020 ◽

Vol 9 (2) ◽

pp. 219 ◽

Cited By ~ 37

Author(s):

Sweta Bhattacharya ◽

Siva Rama Krishnan S ◽

Praveen Kumar Reddy Maddikunta ◽

Rajesh Kaluri ◽

Saurabh Singh ◽

...

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Comprehensive Evaluation ◽

Detection System ◽

Human Life ◽

Principal Component ◽

Cyber Attacks ◽

Classification Model ◽

Learning Approaches ◽

Machine Learning Model

The enormous popularity of the internet across all spheres of human life has introduced various risks of malicious attacks in the network. The activities performed over the network could be effortlessly proliferated, which has led to the emergence of intrusion detection systems. The patterns of the attacks are also dynamic, which necessitates efficient classification and prediction of cyber attacks. In this paper we propose a hybrid principal component analysis (PCA)-firefly based machine learning model to classify intrusion detection system (IDS) datasets. The dataset used in the study is collected from Kaggle. The model first performs One-Hot encoding for the transformation of the IDS datasets. The hybrid PCA-firefly algorithm is then used for dimensionality reduction. The XGBoost algorithm is implemented on the reduced dataset for classification. A comprehensive evaluation of the model is conducted with the state of the art machine learning approaches to justify the superiority of our proposed approach. The experimental results confirm the fact that the proposed model performs better than the existing machine learning models.

Download Full-text

Developing an efficient feature engineering and machine learning model for detecting IoT-Botnet cyber attacks

IEEE Access ◽

10.1109/access.2021.3092054 ◽

2021 ◽

pp. 1-1

Author(s):

Mrutyunjaya Panda ◽

Abd Allah A. Mousa ◽

Aboul Ella Hassanien

Keyword(s):

Machine Learning ◽

Learning Model ◽

Cyber Attacks ◽

Feature Engineering ◽

Machine Learning Model

Download Full-text

Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach

BMC Bioinformatics ◽

10.1186/s12859-021-04491-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Pegah Mavaie ◽

Lawrence Holder ◽

Daniel Beck ◽

Michael K. Skinner

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Classification Performance ◽

Entire Genome ◽

Molecular Features ◽

Environmentally Responsive ◽

Machine Learning Approach ◽

Good Classification Performance ◽

Deep Learning Network ◽

Rat Genome

Abstract Background Deep learning is an active bioinformatics artificial intelligence field that is useful in solving many biological problems, including predicting altered epigenetics such as DNA methylation regions. Deep learning (DL) can learn an informative representation that addresses the need for defining relevant features. However, deep learning models are computationally expensive, and they require large training datasets to achieve good classification performance. Results One approach to addressing these challenges is to use a less complex deep learning network for feature selection and Machine Learning (ML) for classification. In the current study, we introduce a hybrid DL-ML approach that uses a deep neural network for extracting molecular features and a non-DL classifier to predict environmentally responsive transgenerational differential DNA methylated regions (DMRs), termed epimutations, based on the extracted DL-based features. Various environmental toxicant induced epigenetic transgenerational inheritance sperm epimutations were used to train the model on the rat genome DNA sequence and use the model to predict transgenerational DMRs (epimutations) across the entire genome. Conclusion The approach was also used to predict potential DMRs in the human genome. Experimental results show that the hybrid DL-ML approach outperforms deep learning and traditional machine learning methods.

Download Full-text

A machine learning model of microscopic agglutination test for diagnosis of leptospirosis

PLoS ONE ◽

10.1371/journal.pone.0259907 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0259907

Author(s):

Yuji Oyamada ◽

Ryo Ozuru ◽

Toshiyuki Masuzawa ◽

Satoshi Miyahara ◽

Yasuhiko Nikaido ◽

...

Keyword(s):

Machine Learning ◽

Dark Field ◽

Classification Performance ◽

Agglutination Test ◽

Microscopic Agglutination Test ◽

Support Vector ◽

Automatic Identification ◽

Identification Algorithm ◽

Machine Learning Model ◽

Negative Images

Leptospirosis is a zoonosis caused by the pathogenic bacterium Leptospira. The Microscopic Agglutination Test (MAT) is widely used as the gold standard for diagnosis of leptospirosis. In this method, diluted patient serum is mixed with serotype-determined Leptospires, and the presence or absence of aggregation is determined under a dark-field microscope to calculate the antibody titer. Problems of the current MAT method are 1) a requirement of examining many specimens per sample, and 2) a need of distinguishing contaminants from true aggregates to accurately identify positivity. Therefore, increasing efficiency and accuracy are the key to refine MAT. It is possible to achieve efficiency and standardize accuracy at the same time by automating the decision-making process. In this study, we built an automatic identification algorithm of MAT using a machine learning method to determine agglutination within microscopic images. The machine learned the features from 316 positive and 230 negative MAT images created with sera of Leptospira-infected (positive) and non-infected (negative) hamsters, respectively. In addition to the acquired original images, wavelet-transformed images were also considered as features. We utilized a support vector machine (SVM) as a proposed decision method. We validated the trained SVMs with 210 positive and 154 negative images. When the features were obtained from original or wavelet-transformed images, all negative images were misjudged as positive, and the classification performance was very low with sensitivity of 1 and specificity of 0. In contrast, when the histograms of wavelet coefficients were used as features, the performance was greatly improved with sensitivity of 0.99 and specificity of 0.99. We confirmed that the current algorithm judges the positive or negative of agglutinations in MAT images and gives the further possibility of automatizing MAT procedure.

Download Full-text

Efficient Distributed Preprocessing Model for Machine Learning-Based Anomaly Detection over Large-Scale Cybersecurity Datasets

Applied Sciences ◽

10.3390/app10103430 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3430

Author(s):

Xavier Larriva-Novo ◽

Mario Vega-Barbas ◽

Víctor A. Villagrá ◽

Diego Rivera ◽

Manuel Álvarez-Campana ◽

...

Keyword(s):

Machine Learning ◽

Information Society ◽

Large Scale ◽

Cyber Attacks ◽

Machine Learning Techniques ◽

Daily Lives ◽

Pervasive Technology ◽

Machine Learning Model ◽

Provision Of Services ◽

Tree Algorithms

New computational and technological paradigms that currently guide developments in the information society, i.e., Internet of things, pervasive technology, or Ubicomp, favor the appearance of new intrusion vectors that can directly affect people’s daily lives. This, together with advances in techniques and methods used for developing new cyber-attacks, exponentially increases the number of cyber threats which affect the information society. Because of this, the development and improvement of technology that assists cybersecurity experts to prevent and detect attacks arose as a fundamental pillar in the field of cybersecurity. Specifically, intrusion detection systems are now a fundamental tool in the provision of services through the internet. However, these systems have certain limitations, i.e., false positives, real-time analytics, etc., which require their operation to be supervised. Therefore, it is necessary to offer architectures and systems that favor an efficient analysis of the data handled by these tools. In this sense, this paper presents a new model of data preprocessing based on a novel distributed computing architecture focused on large-scale datasets such as UGR’16. In addition, the paper analyzes the use of machine learning techniques in order to improve the response and efficiency of the proposed preprocessing model. Thus, the solution developed achieves good results in terms of computer performance. Finally, the proposal shows the adequateness of decision tree algorithms for training a machine learning model by using a large dataset when compared with a multilayer perceptron neural network.

Download Full-text

Benchmarking machine learning models for late-onset alzheimer’s disease prediction from genomic data

BMC Bioinformatics ◽

10.1186/s12859-019-3158-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Javier De Velasco Oriol ◽

Edgar E. Vallejo ◽

Karol Estrada ◽

José Gerardo Taméz Peña ◽

The Alzheimer’s Disease Neuroimaging Initiative

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Late Onset ◽

Classification Performance ◽

Learning Models ◽

Machine Learning Methods ◽

Machine Learning Model ◽

Cognitive Therapies ◽

Machine Learning Models

Abstract Background Late-Onset Alzheimer’s Disease (LOAD) is a leading form of dementia. There is no effective cure for LOAD, leaving the treatment efforts to depend on preventive cognitive therapies, which stand to benefit from the timely estimation of the risk of developing the disease. Fortunately, a growing number of Machine Learning methods that are well positioned to address this challenge are becoming available. Results We conducted systematic comparisons of representative Machine Learning models for predicting LOAD from genetic variation data provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. Our experimental results demonstrate that the classification performance of the best models tested yielded ∼72% of area under the ROC curve. Conclusions Machine learning models are promising alternatives for estimating the genetic risk of LOAD. Systematic machine learning model selection also provides the opportunity to identify new genetic markers potentially associated with the disease.

Download Full-text

Dear Watch, Should I Get a COVID-19 Test? Designing deployable machine learning for wearables

10.1101/2021.05.11.21257052 ◽

2021 ◽

Author(s):

Bret Nestor ◽

Jaryd Hunter ◽

Raghu Kainkaryam ◽

Erik Drysdale ◽

Jeffrey B Inglis ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Real World ◽

False Positive Rate ◽

Wearable Devices ◽

Classification Performance ◽

Wearable Device ◽

Screening Tools ◽

Machine Learning Model ◽

Positive Rate

Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 ± 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.

Download Full-text