scholarly journals Robust Federated Learning via Collaborative Machine Teaching

2020 ◽  
Vol 34 (04) ◽  
pp. 4075-4082
Author(s):  
Yufei Han ◽  
Xiangliang Zhang

For federated learning systems deployed in the wild, data flaws hosted on local agents are widely witnessed. On one hand, given a large amount (e.g. over 60%) of training data are corrupted by systematic sensor noise and environmental perturbations, the performances of federated model training can be degraded significantly. On the other hand, it is prohibitively expensive for either clients or service providers to set up manual sanitary checks to verify the quality of data instances. In our study, we echo this challenge by proposing a collaborative and privacy-preserving machine teaching method. Specifically, we use a few trusted instances provided by teachers as benign examples in the teaching process. Our collaborative teaching approach seeks jointly the optimal tuning on the distributed training set, such that the model learned from the tuned training set predicts labels of the trusted items correctly. The proposed method couples the process of teaching and learning and thus produces directly a robust prediction model despite the extremely pervasive systematic data corruption. The experimental study on real benchmark data sets demonstrates the validity of our method.

Author(s):  
Hilal Bahlawan ◽  
Mirko Morini ◽  
Michele Pinelli ◽  
Pier Ruggero Spina ◽  
Mauro Venturini

This paper documents the set-up and validation of nonlinear autoregressive exogenous (NARX) models of a heavy-duty single-shaft gas turbine. The considered gas turbine is a General Electric PG 9351FA located in Italy. The data used for model training are time series data sets of several different maneuvers taken experimentally during the start-up procedure and refer to cold, warm and hot start-up. The trained NARX models are used to predict other experimental data sets and comparisons are made among the outputs of the models and the corresponding measured data. Therefore, this paper addresses the challenge of setting up robust and reliable NARX models, by means of a sound selection of training data sets and a sensitivity analysis on the number of neurons. Moreover, a new performance function for the training process is defined to weigh more the most rapid transients. The final aim of this paper is the set-up of a powerful, easy-to-build and very accurate simulation tool which can be used for both control logic tuning and gas turbine diagnostics, characterized by good generalization capability.


2016 ◽  
Vol 2016 (4) ◽  
pp. 21-36 ◽  
Author(s):  
Tao Wang ◽  
Ian Goldberg

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.


2021 ◽  
Author(s):  
John Tsotsos ◽  
Jun Luo

Abstract Learned systems in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and previously unseen data. Since training data sets typically represent such a small sampling of any domain, the possibility of bias in their composition is very real. But what are the limits of generalization given such bias, and up to what point might it be sufficient for a real problem task? Although many have examined issues regarding generalization from several perspectives, this question may require examining the data itself. Here, we focus on the characteristics of the training data that may play a role. Other disciplines have grappled with these problems also, most interestingly epidemiology, where experimental bias is a critical concern. The range and nature of data biases seen clinically are really quite relatable to learned vision systems. One obvious way to deal with bias is to ensure a large enough training set, but this might be infeasible for many domains. Another approach might be to perform a statistical analysis of the actual training set, to determine if all aspects of the domain are fairly captured. This too is difficult, in part because the full set of important variables might not be known, or perhaps not even knowable. Here, we try a different, simpler, approach in the tradition of the Thought Experiment, whose most famous instance is perhaps Schrödinger's Cat, to address part of these problems. There are many types of bias as will be seen, but we focus only on one, selection bias. The point of the thought experiment is not to demonstrate problems with all learned systems. Rather, this might be a simple theoretical tool to probe into bias during data collection to highlight deficiencies that might then deserve extra attention either in data collection or system development.


Geophysics ◽  
2021 ◽  
pp. 1-103
Author(s):  
Jiho Park ◽  
Jihun Choi ◽  
Soon Jee Seol ◽  
Joongmoo Byun ◽  
Young Kim

Deep learning (DL) methods are recently introduced for seismic signal processing. Using DL methods, many researchers have adopted these novel techniques in an attempt to construct a DL model for seismic data reconstruction. The performance of DL-based methods depends heavily on what is learned from the training data. We focus on constructing the DL model that well reflect the features of target data sets. The main goal is to integrate DL with an intuitive data analysis approach that compares similar patterns prior to the DL training stage. We have developed a two-sequential method consisting of two stage: (i) analyzing training and target data sets simultaneously for determining target-informed training set and (ii) training the DL model with this training data set to effectively interpolate the seismic data. Here, we introduce the convolutional autoencoder t-distributed stochastic neighbor embedding (CAE t-SNE) analysis that can provide the insight into the results of interpolation through the analysis of both the training and target data sets prior to DL model training. The proposed method were tested with synthetic and field data. Dense seismic gathers (e.g. common-shot gathers; CSGs) were used as a labeled training data set, and relatively sparse seismic gather (e.g. common-receiver gathers; CRGs) were reconstructed in both cases. The reconstructed results and SNRs demonstrated that the training data can be efficiently selected using CAE t-SNE analysis and the spatial aliasing of CRGs was successfully alleviated by the trained DL model with this training data, which contain target features. These results imply that the data analysis for selecting target-informed training set is very important for successful DL interpolation. Additionally, the proposed analysis method can also be applied to investigate the similarities between training and target data sets for another DL-based seismic data reconstruction tasks.


2019 ◽  
Vol 19 (1) ◽  
pp. 58-63 ◽  
Author(s):  
R. Ciucu ◽  
F.C. Adochiei ◽  
Ioana-Raluca Adochiei ◽  
F. Argatu ◽  
G.C. Seriţan ◽  
...  

AbstractDeveloping Artificial Intelligence is a labor intensive task. It implies both storage and computational resources. In this paper, we present a state-of-the-art service based infrastructure for deploying, managing and serving computational models alongside their respective data-sets and virtual environments. Our architecture uses key-based values to store specific graphs and datasets into memory for fast deployment and model training, furthermore leveraging the need for manual data reduction in the drafting and retraining stages. To develop the platform, we used clustering and orchestration to set up services and containers that allow deployment within seconds. In this article, we cover high performance computing concepts such as swarming, GPU resource management for model implementation in production environments with emphasis on standardized development to reduce integration tasks and performance optimization.


2021 ◽  
Author(s):  
Alastair McKinstry ◽  
Oisin Boydell ◽  
Quan Le ◽  
Inder Preet ◽  
Jennifer Hanafin ◽  
...  

<p>The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.  To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (“AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge. </p><p>Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example  forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be provided.to allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.</p><p> </p><p>This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged. </p><p><br><br>[1] https://aireo.net/</p>


2003 ◽  
Vol 19 ◽  
pp. 315-354 ◽  
Author(s):  
G. M. Weiss ◽  
F. Provost

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a budget-sensitive progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.


2021 ◽  
Vol 14 (2) ◽  
pp. 120-128
Author(s):  
Mohammed Ehsan Safi ◽  
Eyad I. Abbas

In personal image recognition algorithms, two effective factors govern the system’s evaluation, recognition rate and size of the database. Unfortunately, the recognition rate proportional to the increase in training sets. Consequently, that increases the processing time and memory limitation problems. This paper’s main goal was to present a robust algorithm with minimum data sets and a high recognition rate. Images for ten persons were chosen as a database, nine images for each individual as the full version of the training data set, and one image for each person out of the training set as a test pattern before the database reduction procedure. The proposed algorithm integrates Principal Component Analysis (PCA) as a feature extraction technique with the minimum means of clusters and Euclidean Distance to achieve personal recognition. After indexing the training set for each person, the clustering of the differences is determined. The recognition of the person represented by the minimum mean index; this process returned with each reduction. The experimental results show that the recognition rate is 100% despite reducing the training sets to 44%, while the recognition rate decrease to 70% when the reduction reaches 89%. The clear picture out is the results of the proposed system support the idea of the redaction of training sets in addition to obtaining a high recognition rate based on application requirements.


Author(s):  
Hamid Asgari ◽  
Mauro Venturini ◽  
XiaoQi Chen ◽  
Raazesh Sainudiin

This study deals with modeling and simulation of the transient behavior of an Industrial Power Plant Gas Turbine (IPGT). The data used for model setup and validation were taken experimentally during the start-up procedure of a single-shaft heavy duty gas turbine. Two different models are developed and compared by using both a physics-based and a black-box approach, and are implemented by using the matlab© tools including Simulink and Neural Network toolbox, respectively. The Simulink model was constructed based on the thermodynamic and energy balance equations in matlab environment. The nonlinear autoregressive with exogenous inputs NARX model was set up by using the same data sets and subsequently applied to each of the data sets separately. The results showed that both Simulink and NARX models are capable of satisfactory prediction, if it is considered that the data used for model training and validation is experimental data taken during gas turbine normal operation by using its standard instrumentation.


2006 ◽  
Vol 3 (2) ◽  
pp. 285-297 ◽  
Author(s):  
R. G. Kamp ◽  
H. H. G. Savenije

Abstract. Artificial Neural Networks have proven to be good modelling tools in hydrology for rainfall-runoff modelling and hydraulic flow modelling. Representative data sets are necessary for the training phase in which the ANN learns the model's input-output relations. Good and representative training data is not always available. In this publication Genetic Algorithms are used to optimise training data sets. The approach is tested with an existing hydrological model in The Netherlands. The optimised training set resulted in significant better training data.


Sign in / Sign up

Export Citation Format

Share Document