Methodology for Collecting a Training Dataset for an Intrusion Detection Model

The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.

Download Full-text

InsulatorGAN: A Transmission Line Insulator Detection Model Using Multi-Granularity Conditional Generative Adversarial Nets for UAV Inspection

Remote Sensing ◽

10.3390/rs13193971 ◽

2021 ◽

Vol 13 (19) ◽

pp. 3971

Author(s):

Wenxiang Chen ◽

Yingna Li ◽

Zhengang Zhao

Keyword(s):

Transmission Line ◽

Transmission Lines ◽

State Of The Art ◽

Generative Adversarial Network ◽

Detection Model ◽

Adversarial Network ◽

Monte Carlo Search ◽

Model Training ◽

Inspection Tasks

Insulator detection is one of the most significant issues in high-voltage transmission line inspection using unmanned aerial vehicles (UAVs) and has attracted attention from researchers all over the world. The state-of-the-art models in object detection perform well in insulator detection, but the precision is limited by the scale of the dataset and parameters. Recently, the Generative Adversarial Network (GAN) was found to offer excellent image generation. Therefore, we propose a novel model called InsulatorGAN based on using conditional GANs to detect insulators in transmission lines. However, due to the fixed categories in datasets such as ImageNet and Pascal VOC, the generated insulator images are of a low resolution and are not sufficiently realistic. To solve these problems, we established an insulator dataset called InsuGenSet for model training. InsulatorGAN can generate high-resolution, realistic-looking insulator-detection images that can be used for data expansion. Moreover, InsulatorGAN can be easily adapted to other power equipment inspection tasks and scenarios using one generator and multiple discriminators. To give the generated images richer details, we also introduced a penalty mechanism based on a Monte Carlo search in InsulatorGAN. In addition, we proposed a multi-scale discriminator structure based on a multi-task learning mechanism to improve the quality of the generated images. Finally, experiments on the InsuGenSet and CPLID datasets demonstrated that our model outperforms existing state-of-the-art models by advancing both the resolution and quality of the generated images as well as the position of the detection box in the images.

Download Full-text

A test development of a data driven model to simulate chlorophyll data at Tongyeong bay in Korea

10.5194/egusphere-egu2020-13035 ◽

2020 ◽

Author(s):

Sung Dae Kim ◽

Sang Hwa Choi

Keyword(s):

Solar Radiation ◽

Periodic Variation ◽

Training Data ◽

Training Dataset ◽

Sequence Length ◽

Observation Data ◽

The Public ◽

Ocean Science ◽

Moderate Resolution Imaging Spectroradiometer ◽

Hidden Layer

A pilot machine learning(ML) program was developed to test ML technique for simulation of biochemical parameters at the coastal area in Korea. Temperature, chlorophyll, solar radiation, daylight time, humidity, nutrient data were collected as training dataset from the public domain and in-house projects of KIOST(Korea Institute of Ocean Science & Technology). Daily satellite chlorophyll data of MODIS(Moderate Resolution Imaging Spectroradiometer) and GOCI(Geostationary Ocean Color Imager) were retrieved from the public services. Daily SST(Sea Surface Temperature) data and ECMWF solar radiation data were retrieved from GHRSST service and Copernicus service. Meteorological observation data and marine observation data were collected from KMA (Korea Meteorological Agency) and KIOST. The output of marine biochemical numerical model of KIOST were also prepared to validate ML model. ML program was configured using LSTM network and TensorFlow. During the data processing process, some chlorophyll data were interpolated because there were many missing data exist in satellite dataset. ML training were conducted repeatedly under varying combinations of sequence length, learning rate, number of hidden layer and iterations. The 75% of training dataset were used for training and 25% were used for prediction. The maximum correlation between training data and predicted data was 0.995 in case that model output data were used as training dataset. When satellite data and observation data were used, correlations were around 0.55. Though the latter corelation is relatively low, the model simulated periodic variation well and some differences were found at peak values. It is thought that ML model can be applied for simulation of chlorophyll data if preparation of sufficient reliable observation data were possible.

Download Full-text

Rapid Dynamic Naturalistic Monitoring of Bradykinesia in Parkinson’s Disease Using a Wrist-Worn Accelerometer

Sensors ◽

10.3390/s21237876 ◽

2021 ◽

Vol 21 (23) ◽

pp. 7876

Author(s):

Jeroen G. V. Habets ◽

Christian Herff ◽

Pieter L. Kubben ◽

Mark L. Kuijf ◽

Yasin Temel ◽

...

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Motor Fluctuations ◽

Training Data ◽

Training Dataset ◽

Group Model ◽

Accelerometer Data ◽

Small Individual ◽

Therapeutic Benefits ◽

Model Training

Motor fluctuations in Parkinson’s disease are characterized by unpredictability in the timing and duration of dopaminergic therapeutic benefits on symptoms, including bradykinesia and rigidity. These fluctuations significantly impair the quality of life of many Parkinson’s patients. However, current clinical evaluation tools are not designed for the continuous, naturalistic (real-world) symptom monitoring needed to optimize clinical therapy to treat fluctuations. Although commercially available wearable motor monitoring, used over multiple days, can augment neurological decision making, the feasibility of rapid and dynamic detection of motor fluctuations is unclear. So far, applied wearable monitoring algorithms are trained on group data. In this study, we investigated the influence of individual model training on short timescale classification of naturalistic bradykinesia fluctuations in Parkinson’s patients using a single-wrist accelerometer. As part of the Parkinson@Home study protocol, 20 Parkinson patients were recorded with bilateral wrist accelerometers for a one hour OFF medication session and a one hour ON medication session during unconstrained activities in their own homes. Kinematic metrics were extracted from the accelerometer data from the bodyside with the largest unilateral bradykinesia fluctuations across medication states. The kinematic accelerometer features were compared over the 1 h duration of recording, and medication-state classification analyses were performed on 1 min segments of data. Then, we analyzed the influence of individual versus group model training, data window length, and total number of training patients included in group model training, on classification. Statistically significant areas under the curves (AUCs) for medication induced bradykinesia fluctuation classification were seen in 85% of the Parkinson patients at the single minute timescale using the group models. Individually trained models performed at the same level as the group trained models (mean AUC both 0.70, standard deviation respectively 0.18 and 0.10) despite the small individual training dataset. AUCs of the group models improved as the length of the feature windows was increased to 300 s, and with additional training patient datasets. We were able to show that medication-induced fluctuations in bradykinesia can be classified using wrist-worn accelerometry at the time scale of a single minute. Rapid, naturalistic Parkinson motor monitoring has the clinical potential to evaluate dynamic symptomatic and therapeutic fluctuations and help tailor treatments on a fast timescale.

Download Full-text

Quantifying identifiability to choose and audit ϵ in differentially private deep learning

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484231 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3335-3347

Author(s):

Daniel Bernau ◽

Günther Eibl ◽

Philip W. Grassal ◽

Hannah Keller ◽

Florian Kerschbaum

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Training Data ◽

Training Dataset ◽

Privacy Leakage ◽

Societal Norms ◽

Machine Learning Model ◽

Model Training ◽

Parameter Values ◽

Learning Data

Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters (ϵ, δ ). Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss (ϵ, δ) might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which (ϵ, δ ) are only indirectly related. We transform (ϵ, δ ) to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical (ϵ, δ ).

Download Full-text

Coarse-to-Fine Adaptive People Detection for Video Sequences by Maximizing Mutual Information †

Sensors ◽

10.3390/s19010004 ◽

2018 ◽

Vol 19 (1) ◽

pp. 4

Author(s):

Álvaro García-Martín ◽

Juan SanMiguel ◽

José Martínez

Keyword(s):

Mutual Information ◽

Detection Threshold ◽

Ground Truth ◽

Training Data ◽

Training Dataset ◽

People Detection ◽

Detection Model ◽

Unseen Data ◽

Bounding Boxes ◽

Coarse To Fine

Applying people detectors to unseen data is challenging since patterns distributions, such as viewpoints, motion, poses, backgrounds, occlusions and people sizes, may significantly differ from the ones of the training dataset. In this paper, we propose a coarse-to-fine framework to adapt frame by frame people detectors during runtime classification, without requiring any additional manually labeled ground truth apart from the offline training of the detection model. Such adaptation make use of multiple detectors mutual information, i.e., similarities and dissimilarities of detectors estimated and agreed by pair-wise correlating their outputs. Globally, the proposed adaptation discriminates between relevant instants in a video sequence, i.e., identifies the representative frames for an adaptation of the system. Locally, the proposed adaptation identifies the best configuration (i.e., detection threshold) of each detector under analysis, maximizing the mutual information to obtain the detection threshold of each detector. The proposed coarse-to-fine approach does not require training the detectors for each new scenario and uses standard people detector outputs, i.e., bounding boxes. The experimental results demonstrate that the proposed approach outperforms state-of-the-art detectors whose optimal threshold configurations are previously determined and fixed from offline training data.

Download Full-text

Semi-Supervised Deep Learning for Lunar Crater Detection Using CE-2 DOM

Remote Sensing ◽

10.3390/rs13142819 ◽

2021 ◽

Vol 13 (14) ◽

pp. 2819

Author(s):

Sudong Zang ◽

Lingli Mu ◽

Lina Xian ◽

Wei Zhang

Keyword(s):

Deep Learning ◽

Landing Site ◽

Training Data ◽

The Moon ◽

Detection Model ◽

Processing Times ◽

Crater Detection ◽

High Resolution Imagery ◽

Model Training ◽

Digital Orthophoto

Lunar craters are very important for estimating the geological age of the Moon, studying the evolution of the Moon, and for landing site selection. Due to a lack of labeled samples, processing times due to high-resolution imagery, the small number of suitable detection models, and the influence of solar illumination, Crater Detection Algorithms (CDAs) based on Digital Orthophoto Maps (DOMs) have not yet been well-developed. In this paper, a large number of training data are labeled manually in the Highland and Maria regions, using the Chang’E-2 (CE-2) DOM; however, the labeled data cannot cover all kinds of crater types. To solve the problem of small crater detection, a new crater detection model (Crater R-CNN) is proposed, which can effectively extract the spatial and semantic information of craters from DOM data. As incomplete labeled samples are not conducive for model training, the Two-Teachers Self-training with Noise (TTSN) method is used to train the Crater R-CNN model, thus constructing a new model—called Crater R-CNN with TTSN—which can achieve state-of-the-art performance. To evaluate the accuracy of the model, three other detection models (Mask R-CNN, no-Mask R-CNN, and Crater R-CNN) based on semi-supervised deep learning were used to detect craters in the Highland and Maria regions. The results indicate that Crater R-CNN with TTSN achieved the highest precision (of 91.4% and 88.5%, respectively) in the Highland and Maria regions, even obtaining the highest recall and F1 score. Compared with Mask R-CNN, no-Mask R-CNN, and Crater R-CNN, Crater R-CNN with TTSN had strong robustness and better generalization ability for crater detection within 1 km in different terrains, making it possible to detect small craters with high accuracy when using DOM data.

Download Full-text

Improved Training of CAE-Based Defect Detectors Using Structural Noise

Applied Sciences ◽

10.3390/app112412062 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12062

Author(s):

Reina Murakami ◽

Valentin Grave ◽

Osamu Fukuda ◽

Hiroshi Okumura ◽

Nobuhiko Yamaguchi

Keyword(s):

Test Data ◽

Gaussian Noise ◽

Visual Inspection ◽

Noisy Data ◽

Training Data ◽

Noise Factor ◽

Training Dataset ◽

Structural Noise

Appearances of products are important to companies as they reflect the quality of their manufacture to customers. Nowadays, visual inspection is conducted by human inspectors. This research attempts to automate this process using Convolutional AutoEncoders (CAE). Our models were trained using images of non-defective parts. Previous research on autoencoders has reported that the accuracy of image regeneration can be improved by adding noise to the training dataset, but no extensive analyse of the noise factor has been done. Therefore, our method compares the effects of two different noise patterns on the models efficiency: Gaussian noise and noise made of a known structure. The test datasets were comprised of “defective” parts. Over the experiments, it has mostly been observed that the precision of the CAE sharpened when using noisy data during the training phases. The best results were obtained with structural noise, made of defined shapes randomly corrupting training data. Furthermore, the models were able to process test data that had slightly different positions and rotations compared to the ones found in the training dataset. However, shortcomings appeared when “regular” spots (in the training data) and “defective” spots (in the test data) partially, or totally, overlapped.

Download Full-text

Edge Learning

ACM Computing Surveys ◽

10.1145/3464419 ◽

2021 ◽

Vol 54 (7) ◽

pp. 1-36

Author(s):

Jie Zhang ◽

Zhihao Qu ◽

Chenxi Chen ◽

Haozhao Wang ◽

Yufeng Zhan ◽

...

Keyword(s):

Big Data Analytics ◽

Training Data ◽

Future Research ◽

Great Promise ◽

Communication Overhead ◽

Comprehensive Overview ◽

Training Models ◽

Comprehensive Survey ◽

Model Training ◽

Privacy Issues

Machine Learning ( ML ) has demonstrated great promise in various fields, e.g., self-driving, smart city, which are fundamentally altering the way individuals and organizations live, work, and interact. Traditional centralized learning frameworks require uploading all training data from different sources to a remote data server, which incurs significant communication overhead, service latency, and privacy issues. To further extend the frontiers of the learning paradigm, a new learning concept, namely, Edge Learning ( EL ) is emerging. It is complementary to the cloud-based methods for big data analytics by enabling distributed edge nodes to cooperatively training models and conduct inferences with their locally cached data. To explore the new characteristics and potential prospects of EL, we conduct a comprehensive survey of the recent research efforts on EL. Specifically, we first introduce the background and motivation. We then discuss the challenging issues in EL from the aspects of data, computation, and communication. Furthermore, we provide an overview of the enabling technologies for EL, including model training, inference, security guarantee, privacy protection, and incentive mechanism. Finally, we discuss future research opportunities on EL. We believe that this survey will provide a comprehensive overview of EL and stimulate fruitful future research in this field.

Download Full-text

AI-Ready Training Datasets for Earth Observation: Enabling FAIR data principles for EO training data.

10.5194/egusphere-egu21-12384 ◽

2021 ◽

Author(s):

Alastair McKinstry ◽

Oisin Boydell ◽

Quan Le ◽

Inder Preet ◽

Jennifer Hanafin ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Forest Biomass ◽

Earth Observation ◽

Training Data ◽

Training Dataset ◽

Data Provenance ◽

Data Sets ◽

Model Training ◽

Ice Detection

The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.&#160; To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (&#8220;AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge.&#160;Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example&#160; forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be provided.to allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.&#160;This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged.&#160; [1] https://aireo.net/

Download Full-text

The Evaluation of Acute Myeloid Leukaemia (AML) Blood Cell Detection Models Using Different YOLO Approaches

10.1101/2021.08.04.455113 ◽

2021 ◽

Author(s):

Kaung Myat Naing ◽

Veerayuth Kittichai ◽

Teerawat Tongloy ◽

Santhad Chuwongin Chuwongin ◽

Siridech Boonsang

Keyword(s):

Acute Myeloid Leukaemia ◽

Object Detection ◽

Myeloid Leukaemia ◽

Data Augmentation ◽

Training Data ◽

Training Dataset ◽

Cell Detection ◽

Augmentation Techniques ◽

Model Training ◽

Acute Myeloid

This study proposes to evaluate the performance of Acute Myeloid Leukaemia (AML) blast cell detection models in microscopic examination images for faster diagnosis and disease monitoring. One of the popular deep learning algorithms such as You Only Look Once (YOLO) developed for object detection is the successful state-of-the-art algorithms in real-time object detection systems. We employ four versions of the YOLO algorithm: YOLOv3, YOLOv3-Tiny, YOLOv2 and YOLOv2-Tiny for detection of 15-class of AML blood cells in examination images. We also acquired the publicly available dataset from The Cancer Imaging Archive (TCIA), which consists of 18,365 expert-labelled single-cell images. Data augmentation techniques are additionally applied to enhance and balance the training images in the dataset. The overall results indicated that four types of YOLO approach have outstanding performances of more than 92% in precision and sensitivity. In comparison, YOLOv3 has more reliable performance than the other three approaches. Consistently, the AUC values for the four YOLO models are 0.969 (YOLOv3), 0.967 (YOLOv3-Tiny), 0.963 (YOLOv2), and 0.948 (YOLOv2-Tiny). Furthermore, we compare the best model's performance between approaches that use the entire training dataset without using data augmentation techniques and image division with data augmentation techniques. Remarkably, by using 33.51 percent of the training data in model training, the prediction outcomes from the model that used image partitioning with data augmentation were similar to those obtained using the complete training dataset. This work potentially provides a beneficial digital rapid tool in the screening and evaluation of numerous haematological disorders.

Download Full-text