scholarly journals ActiveThief: Model Extraction Using Active Learning and Unannotated Public Data

2020 ◽  
Vol 34 (01) ◽  
pp. 865-872
Author(s):  
Soham Pal ◽  
Yash Gupta ◽  
Aditya Shukla ◽  
Aditya Kanade ◽  
Shirish Shevade ◽  
...  

Machine learning models are increasingly being deployed in practice. Machine Learning as a Service (MLaaS) providers expose such models to queries by third-party developers through application programming interfaces (APIs). Prior work has developed model extraction attacks, in which an attacker extracts an approximation of an MLaaS model by making black-box queries to it. We design ActiveThief – a model extraction framework for deep neural networks that makes use of active learning techniques and unannotated public datasets to perform model extraction. It does not expect strong domain knowledge or access to annotated data on the part of the attacker. We demonstrate that (1) it is possible to use ActiveThief to extract deep classifiers trained on a variety of datasets from image and text domains, while querying the model with as few as 10-30% of samples from public datasets, (2) the resulting model exhibits a higher transferability success rate of adversarial examples than prior work, and (3) the attack evades detection by the state-of-the-art model extraction detection method, PRADA.

2021 ◽  
Author(s):  
Lucas Pereira ◽  
Nuno Velosa ◽  
Manuel Pereira

There is a consensus in the Non-Intrusive Load Monitoring research community on the importance of public datasets in furthering load disaggregation research. Nevertheless, despite the considerable efforts to release public data, not many steps have been taken towards homogenizing how these are made available to the community and providing seamless access to the available and future datasets.In this paper, we present the Energy Monitoring and Disaggregation Data Format (EMD-DF64). EMD-DF64 is a data model, file format, and application programming interface developed to provide a unique interface to create, manage, and access electric energy datasets. The proposed format is an extension of the well-known Sony WAVE64 format that supports the storage of audio data and metadata annotations in the same file, thus reducing the number of artifacts that have to be maintained by the dataset users.


Author(s):  
K L Malanchev ◽  
M V Pruzhinskaya ◽  
V S Korolev ◽  
P D Aleo ◽  
M V Kornilov ◽  
...  

Abstract We present results from applying the SNAD anomaly detection pipeline to the third public data release of the Zwicky Transient Facility (ZTF DR3). The pipeline is composed of 3 stages: feature extraction, search of outliers with machine learning algorithms and anomaly identification with followup by human experts. Our analysis concentrates in three ZTF fields, comprising more than 2.25 million objects. A set of 4 automatic learning algorithms was used to identify 277 outliers, which were subsequently scrutinised by an expert. From these, 188 (68%) were found to be bogus light curves – including effects from the image subtraction pipeline as well as overlapping between a star and a known asteroid, 66 (24%) were previously reported sources whereas 23 (8%) correspond to non-catalogued objects, with the two latter cases of potential scientific interest (e. g. 1 spectroscopically confirmed RS Canum Venaticorum star, 4 supernovae candidates, 1 red dwarf flare). Moreover, using results from the expert analysis, we were able to identify a simple bi-dimensional relation which can be used to aid filtering potentially bogus light curves in future studies. We provide a complete list of objects with potential scientific application so they can be further scrutinised by the community. These results confirm the importance of combining automatic machine learning algorithms with domain knowledge in the construction of recommendation systems for astronomy. Our code is publicly available*.


10.29007/thws ◽  
2019 ◽  
Author(s):  
Lukas Hahn ◽  
Lutz Roese-Koerner ◽  
Peet Cremer ◽  
Urs Zimmermann ◽  
Ori Maoz ◽  
...  

Active Learning is concerned with the question of how to identify the most useful samples for a Machine Learning algorithm to be trained with. When applied correctly, it can be a very powerful tool to counteract the immense data requirements of Artificial Neural Networks. However, we find that it is often applied with not enough care and domain knowledge. As a consequence, unrealistic hopes are raised and transfer of the experimental results from one dataset to another becomes unnecessarily hard.In this work we analyse the robustness of different Active Learning methods with respect to classifier capacity, exchangeability and type, as well as hyperparameters and falsely labelled data. Experiments reveal possible biases towards the architecture used for sample selection, resulting in suboptimal performance for other classifiers. We further propose the new ”Sum of Squared Logits” method based on the Simpson diversity index and investigate the effect of using the confusion matrix for balancing in sample selection.


2022 ◽  
Vol 3 (2) ◽  
pp. 1-28
Author(s):  
Besat Kassaie ◽  
Elizabeth L. Irving ◽  
Frank Wm. Tompa

The standard approach to expert-in-the-loop machine learning is active learning, where, repeatedly, an expert is asked to annotate one or more records and the machine finds a classifier that respects all annotations made until that point. We propose an alternative approach, IQRef , in which the expert iteratively designs a classifier and the machine helps him or her to determine how well it is performing and, importantly, when to stop, by reporting statistics on a fixed, hold-out sample of annotated records. We justify our approach based on prior work giving a theoretical model of how to re-use hold-out data. We compare the two approaches in the context of identifying a cohort of EHRs and examine their strengths and weaknesses through a case study arising from an optometric research problem. We conclude that both approaches are complementary, and we recommend that they both be employed in conjunction to address the problem of cohort identification in health research.


Author(s):  
Xueluan Gong ◽  
Yanjiao Chen ◽  
Wenbin Yang ◽  
Guanghao Mei ◽  
Qian Wang

Cloud service providers, including Google, Amazon, and Alibaba, have now launched machine-learning-as-a-service (MLaaS) platforms, allowing clients to access sophisticated cloud-based machine learning models via APIs. Unfortunately, however, the commercial value of these models makes them alluring targets for theft, and their strategic position as part of the IT infrastructure of many companies makes them an enticing springboard for conducting further adversarial attacks. In this paper, we put forth a novel and effective attack strategy, dubbed InverseNet, that steals the functionality of black-box cloud-based models with only a small number of queries. The crux of the innovation is that, unlike existing model extraction attacks that rely on public datasets or adversarial samples, InverseNet constructs inversed training samples to increase the similarity between the extracted substitute model and the victim model. Further, only a small number of data samples with high confidence scores (rather than an entire dataset) are used to reconstruct the inversed dataset, which substantially reduces the attack cost. Extensive experiments conducted on three simulated victim models and Alibaba Cloud's commercially-available API demonstrate that InverseNet yields a model with significantly greater functional similarity to the victim model than the current state-of-the-art attacks at a substantially lower query budget.


2018 ◽  
Vol 9 (1) ◽  
pp. 24-31
Author(s):  
Rudianto Rudianto ◽  
Eko Budi Setiawan

Availability the Application Programming Interface (API) for third-party applications on Android devices provides an opportunity to monitor Android devices with each other. This is used to create an application that can facilitate parents in child supervision through Android devices owned. In this study, some features added to the classification of image content on Android devices related to negative content. In this case, researchers using Clarifai API. The result of this research is to produce a system which has feature, give a report of image file contained in target smartphone and can do deletion on the image file, receive browser history report and can directly visit in the application, receive a report of child location and can be directly contacted via this application. This application works well on the Android Lollipop (API Level 22). Index Terms— Application Programming Interface(API), Monitoring, Negative Content, Children, Parent.


2021 ◽  
Author(s):  
Vu-Linh Nguyen ◽  
Mohammad Hossein Shaker ◽  
Eyke Hüllermeier

AbstractVarious strategies for active learning have been proposed in the machine learning literature. In uncertainty sampling, which is among the most popular approaches, the active learner sequentially queries the label of those instances for which its current prediction is maximally uncertain. The predictions as well as the measures used to quantify the degree of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative approaches to capturing uncertainty in machine learning, alongside with corresponding uncertainty measures, have been proposed in recent years. In particular, some of these measures seek to distinguish different sources and to separate different types of uncertainty, such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty in a prediction. The goal of this paper is to elaborate on the usefulness of such measures for uncertainty sampling, and to compare their performance in active learning. To this end, we instantiate uncertainty sampling with different measures, analyze the properties of the sampling strategies thus obtained, and compare them in an experimental study.


Data ◽  
2021 ◽  
Vol 6 (8) ◽  
pp. 87
Author(s):  
Sara Ferreira ◽  
Mário Antunes ◽  
Manuel E. Correia

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.


2021 ◽  
Vol 10 (3) ◽  
pp. 154
Author(s):  
Robert Jeansoulin

Providing long-term data about the evolution of railway networks in Europe may help us understand how European Union (EU) member states behave in the long-term, and how they can comply with present EU recommendations. This paper proposes a methodology for collecting data about railway stations, at the maximal extent of the French railway network, a century ago.The expected outcome is a geocoded dataset of French railway stations (gares), which: (a) links gares to each other, (b) links gares with French communes, the basic administrative level for statistical information. Present stations are well documented in public data, but thousands of past stations are sparsely recorded, not geocoded, and often ignored, except in volunteer geographic information (VGI), either collaboratively through Wikipedia or individually. VGI is very valuable in keeping track of that heritage, and remote sensing, including aerial photography is often the last chance to obtain precise locations. The approach is a series of steps: (1) meta-analysis of the public datasets, (2) three-steps fusion: measure-decision-combination, between public datasets, (3) computer-assisted geocoding for ‘gares’ where fusion fails, (4) integration of additional gares gathered from VGI, (5) automated quality control, indicating where quality is questionable. These five families of methods, form a comprehensive computer-assisted reconstruction process (CARP), which constitutes the core of this paper. The outcome is a reliable dataset—in geojson format under open license—encompassing (by January 2021) more than 10,700 items linked to about 7500 of the 35,500 communes of France: that is 60% more than recorded before. This work demonstrates: (a) it is possible to reconstruct transport data from the past, at a national scale; (b) the value of remote sensing and of VGI is considerable in completing public sources from an historical perspective; (c) data quality can be monitored all along the process and (d) the geocoded outcome is ready for a large variety of further studies with statistical data (demography, density, space coverage, CO2 simulation, environmental policies, etc.).


2021 ◽  
Author(s):  
Tom Young ◽  
Tristan Johnston-Wood ◽  
Volker L. Deringer ◽  
Fernanda Duarte

Predictive molecular simulations require fast, accurate and reactive interatomic potentials. Machine learning offers a promising approach to construct such potentials by fitting energies and forces to high-level quantum-mechanical data, but...


Sign in / Sign up

Export Citation Format

Share Document