Robust Federated Learning via Collaborative Machine Teaching

For federated learning systems deployed in the wild, data flaws hosted on local agents are widely witnessed. On one hand, given a large amount (e.g. over 60%) of training data are corrupted by systematic sensor noise and environmental perturbations, the performances of federated model training can be degraded significantly. On the other hand, it is prohibitively expensive for either clients or service providers to set up manual sanitary checks to verify the quality of data instances. In our study, we echo this challenge by proposing a collaborative and privacy-preserving machine teaching method. Specifically, we use a few trusted instances provided by teachers as benign examples in the teaching process. Our collaborative teaching approach seeks jointly the optimal tuning on the distributed training set, such that the model learned from the tuned training set predicts labels of the trusted items correctly. The proposed method couples the process of teaching and learning and thus produces directly a robust prediction model despite the extremely pervasive systematic data corruption. The experimental study on real benchmark data sets demonstrates the validity of our method.

Download Full-text

Development of Reliable NARX Models of Gas Turbine Cold, Warm and Hot Start-Up

Volume 9: Oil and Gas Applications; Supercritical CO2 Power Cycles; Wind Energy ◽

10.1115/gt2017-63332 ◽

2017 ◽

Cited By ~ 2

Author(s):

Hilal Bahlawan ◽

Mirko Morini ◽

Michele Pinelli ◽

Pier Ruggero Spina ◽

Mauro Venturini

Keyword(s):

Gas Turbine ◽

Training Data ◽

Series Data ◽

Data Sets ◽

Control Logic ◽

Start Up ◽

Hot Start ◽

Narx Models ◽

Set Up ◽

Rapid Transients

This paper documents the set-up and validation of nonlinear autoregressive exogenous (NARX) models of a heavy-duty single-shaft gas turbine. The considered gas turbine is a General Electric PG 9351FA located in Italy. The data used for model training are time series data sets of several different maneuvers taken experimentally during the start-up procedure and refer to cold, warm and hot start-up. The trained NARX models are used to predict other experimental data sets and comparisons are made among the outputs of the models and the corresponding measured data. Therefore, this paper addresses the challenge of setting up robust and reliable NARX models, by means of a sound selection of training data sets and a sensitivity analysis on the number of neurons. Moreover, a new performance function for the training process is defined to weigh more the most rapid transients. The final aim of this paper is the set-up of a powerful, easy-to-build and very accurate simulation tool which can be used for both control logic tuning and gas turbine diagnostics, characterized by good generalization capability.

Download Full-text

On Realistically Attacking Tor with Website Fingerprinting

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2016-0027 ◽

2016 ◽

Vol 2016 (4) ◽

pp. 21-36 ◽

Cited By ~ 25

Author(s):

Tao Wang ◽

Ian Goldberg

Keyword(s):

Background Noise ◽

Laboratory Tests ◽

Training Data ◽

Web Traffic ◽

Training Set ◽

Data Set ◽

Laboratory Conditions ◽

Testing Data ◽

In The Wild ◽

New Algorithms

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.

Download Full-text

Probing the Effect of Selection Bias on Generalization: A Thought Experiment

10.21203/rs.3.rs-1117982/v1 ◽

2021 ◽

Author(s):

John Tsotsos ◽

Jun Luo

Keyword(s):

Data Collection ◽

Selection Bias ◽

Visual Recognition ◽

Thought Experiment ◽

System Development ◽

Training Data ◽

Real Problem ◽

Data Sets ◽

Training Set ◽

Unseen Data

Abstract Learned systems in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and previously unseen data. Since training data sets typically represent such a small sampling of any domain, the possibility of bias in their composition is very real. But what are the limits of generalization given such bias, and up to what point might it be sufficient for a real problem task? Although many have examined issues regarding generalization from several perspectives, this question may require examining the data itself. Here, we focus on the characteristics of the training data that may play a role. Other disciplines have grappled with these problems also, most interestingly epidemiology, where experimental bias is a critical concern. The range and nature of data biases seen clinically are really quite relatable to learned vision systems. One obvious way to deal with bias is to ensure a large enough training set, but this might be infeasible for many domains. Another approach might be to perform a statistical analysis of the actual training set, to determine if all aspects of the domain are fairly captured. This too is difficult, in part because the full set of important variables might not be known, or perhaps not even knowable. Here, we try a different, simpler, approach in the tradition of the Thought Experiment, whose most famous instance is perhaps Schrödinger's Cat, to address part of these problems. There are many types of bias as will be seen, but we focus only on one, selection bias. The point of the thought experiment is not to demonstrate problems with all learned systems. Rather, this might be a simple theoretical tool to probe into bias during data collection to highlight deficiencies that might then deserve extra attention either in data collection or system development.

Download Full-text

A method for adequate selection of training data sets to reconstruct seismic field data using a convolutional U-Net

Geophysics ◽

10.1190/geo2019-0708.1 ◽

2021 ◽

pp. 1-103

Author(s):

Jiho Park ◽

Jihun Choi ◽

Soon Jee Seol ◽

Joongmoo Byun ◽

Young Kim

Keyword(s):

Data Analysis ◽

Seismic Data ◽

Field Data ◽

Training Data ◽

Data Reconstruction ◽

Data Sets ◽

Training Set ◽

Data Set ◽

Seismic Data Reconstruction ◽

Target Data

Deep learning (DL) methods are recently introduced for seismic signal processing. Using DL methods, many researchers have adopted these novel techniques in an attempt to construct a DL model for seismic data reconstruction. The performance of DL-based methods depends heavily on what is learned from the training data. We focus on constructing the DL model that well reflect the features of target data sets. The main goal is to integrate DL with an intuitive data analysis approach that compares similar patterns prior to the DL training stage. We have developed a two-sequential method consisting of two stage: (i) analyzing training and target data sets simultaneously for determining target-informed training set and (ii) training the DL model with this training data set to effectively interpolate the seismic data. Here, we introduce the convolutional autoencoder t-distributed stochastic neighbor embedding (CAE t-SNE) analysis that can provide the insight into the results of interpolation through the analysis of both the training and target data sets prior to DL model training. The proposed method were tested with synthetic and field data. Dense seismic gathers (e.g. common-shot gathers; CSGs) were used as a labeled training data set, and relatively sparse seismic gather (e.g. common-receiver gathers; CRGs) were reconstructed in both cases. The reconstructed results and SNRs demonstrated that the training data can be efficiently selected using CAE t-SNE analysis and the spatial aliasing of CRGs was successfully alleviated by the trained DL model with this training data, which contain target features. These results imply that the data analysis for selecting target-informed training set is very important for successful DL interpolation. Additionally, the proposed analysis method can also be applied to investigate the similarities between training and target data sets for another DL-based seismic data reconstruction tasks.

Download Full-text

Innovative Devops for Artificial Intelligence

The Scientific Bulletin of Electrical Engineering Faculty ◽

10.1515/sbeef-2019-0011 ◽

2019 ◽

Vol 19 (1) ◽

pp. 58-63 ◽

Cited By ~ 1

Author(s):

R. Ciucu ◽

F.C. Adochiei ◽

Ioana-Raluca Adochiei ◽

F. Argatu ◽

G.C. Seriţan ◽

...

Keyword(s):

Artificial Intelligence ◽

Performance Optimization ◽

High Performance ◽

Computational Models ◽

Data Sets ◽

Production Environments ◽

Model Training ◽

And Performance ◽

Set Up ◽

Computational Resources

AbstractDeveloping Artificial Intelligence is a labor intensive task. It implies both storage and computational resources. In this paper, we present a state-of-the-art service based infrastructure for deploying, managing and serving computational models alongside their respective data-sets and virtual environments. Our architecture uses key-based values to store specific graphs and datasets into memory for fast deployment and model training, furthermore leveraging the need for manual data reduction in the drafting and retraining stages. To develop the platform, we used clustering and orchestration to set up services and containers that allow deployment within seconds. In this article, we cover high performance computing concepts such as swarming, GPU resource management for model implementation in production environments with emphasis on standardized development to reduce integration tasks and performance optimization.

Download Full-text

AI-Ready Training Datasets for Earth Observation: Enabling FAIR data principles for EO training data.

10.5194/egusphere-egu21-12384 ◽

2021 ◽

Author(s):

Alastair McKinstry ◽

Oisin Boydell ◽

Quan Le ◽

Inder Preet ◽

Jennifer Hanafin ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Forest Biomass ◽

Earth Observation ◽

Training Data ◽

Training Dataset ◽

Data Provenance ◽

Data Sets ◽

Model Training ◽

Ice Detection

The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.&#160; To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (&#8220;AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge.&#160;Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example&#160; forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be provided.to allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.&#160;This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged.&#160; [1] https://aireo.net/

Download Full-text

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Journal of Artificial Intelligence Research ◽

10.1613/jair.1199 ◽

2003 ◽

Vol 19 ◽

pp. 315-354 ◽

Cited By ~ 491

Author(s):

G. M. Weiss ◽

F. Provost

Keyword(s):

Inductive Learning ◽

Practical Importance ◽

Classification Performance ◽

Training Data ◽

Data Sets ◽

Training Set ◽

Class Distribution ◽

Classifier Performance ◽

Balanced Distribution ◽

Training Examples

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a budget-sensitive progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.

Download Full-text

The Effect of Clustering with a minimum Pattern of Features Extraction for Person Recognition

Diyala Journal of Engineering Sciences ◽

10.24237/djes.2021.14211 ◽

2021 ◽

Vol 14 (2) ◽

pp. 120-128

Author(s):

Mohammed Ehsan Safi ◽

Eyad I. Abbas

Keyword(s):

Recognition Rate ◽

Principal Component ◽

Training Data ◽

Data Sets ◽

Training Set ◽

Data Set ◽

Person Recognition ◽

Personal Recognition ◽

Application Requirements ◽

Training Sets

In personal image recognition algorithms, two effective factors govern the system’s evaluation, recognition rate and size of the database. Unfortunately, the recognition rate proportional to the increase in training sets. Consequently, that increases the processing time and memory limitation problems. This paper’s main goal was to present a robust algorithm with minimum data sets and a high recognition rate. Images for ten persons were chosen as a database, nine images for each individual as the full version of the training data set, and one image for each person out of the training set as a test pattern before the database reduction procedure. The proposed algorithm integrates Principal Component Analysis (PCA) as a feature extraction technique with the minimum means of clusters and Euclidean Distance to achieve personal recognition. After indexing the training set for each person, the clustering of the differences is determined. The recognition of the person represented by the minimum mean index; this process returned with each reduction. The experimental results show that the recognition rate is 100% despite reducing the training sets to 44%, while the recognition rate decrease to 70% when the reduction reaches 89%. The clear picture out is the results of the proposed system support the idea of the redaction of training sets in addition to obtaining a high recognition rate based on application requirements.

Download Full-text

Modeling and Simulation of the Transient Behavior of an Industrial Power Plant Gas Turbine

Journal of Engineering for Gas Turbines and Power ◽

10.1115/1.4026215 ◽

2014 ◽

Vol 136 (6) ◽

Cited By ~ 19

Author(s):

Hamid Asgari ◽

Mauro Venturini ◽

XiaoQi Chen ◽

Raazesh Sainudiin

Keyword(s):

Power Plant ◽

Modeling And Simulation ◽

Gas Turbine ◽

Transient Behavior ◽

Normal Operation ◽

Data Sets ◽

Simulink Model ◽

Model Training ◽

Heavy Duty Gas Turbine ◽

Set Up

This study deals with modeling and simulation of the transient behavior of an Industrial Power Plant Gas Turbine (IPGT). The data used for model setup and validation were taken experimentally during the start-up procedure of a single-shaft heavy duty gas turbine. Two different models are developed and compared by using both a physics-based and a black-box approach, and are implemented by using the matlab© tools including Simulink and Neural Network toolbox, respectively. The Simulink model was constructed based on the thermodynamic and energy balance equations in matlab environment. The nonlinear autoregressive with exogenous inputs NARX model was set up by using the same data sets and subsequently applied to each of the data sets separately. The results showed that both Simulink and NARX models are capable of satisfactory prediction, if it is considered that the data used for model training and validation is experimental data taken during gas turbine normal operation by using its standard instrumentation.

Download Full-text

Optimising training data for ANNs with Genetic Algorithms

Hydrology and Earth System Sciences Discussions ◽

10.5194/hessd-3-285-2006 ◽

2006 ◽

Vol 3 (2) ◽

pp. 285-297 ◽

Cited By ~ 4

Author(s):

R. G. Kamp ◽

H. H. G. Savenije

Keyword(s):

Genetic Algorithms ◽

Hydrological Model ◽

Training Data ◽

Data Sets ◽

Training Phase ◽

Rainfall Runoff ◽

Input Output ◽

Training Set ◽

Hydraulic Flow ◽

Flow Modelling

Abstract. Artificial Neural Networks have proven to be good modelling tools in hydrology for rainfall-runoff modelling and hydraulic flow modelling. Representative data sets are necessary for the training phase in which the ANN learns the model's input-output relations. Good and representative training data is not always available. In this publication Genetic Algorithms are used to optimise training data sets. The approach is tested with an existing hydrological model in The Netherlands. The optimised training set resulted in significant better training data.

Download Full-text