Olympus: Sensor Privacy through Utility Aware Obfuscation

Abstract Personal data garnered from various sensors are often offloaded by applications to the cloud for analytics. This leads to a potential risk of disclosing private user information. We observe that the analytics run on the cloud are often limited to a machine learning model such as predicting a user’s activity using an activity classifier. We present Olympus, a privacy framework that limits the risk of disclosing private user information by obfuscating sensor data while minimally affecting the functionality the data are intended for. Olympus achieves privacy by designing a utility aware obfuscation mechanism, where privacy and utility requirements are modeled as adversarial networks. By rigorous and comprehensive evaluation on a real world app and on benchmark datasets, we show that Olympus successfully limits the disclosure of private information without significantly affecting functionality of the application.

Download Full-text

A study of real-world micrograph data quality and machine learning model robustness

npj Computational Materials ◽

10.1038/s41524-021-00616-3 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Xiaoting Zhong ◽

Brian Gallagher ◽

Keenan Eves ◽

Emily Robertson ◽

T. Nathan Mundhenk ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real World ◽

Image Feature ◽

Molecular Solids ◽

Pixel Intensity ◽

Machine Learning Model ◽

Model Predictions ◽

Intensity Normalization ◽

Model Robustness

AbstractMachine-learning (ML) techniques hold the potential of enabling efficient quantitative micrograph analysis, but the robustness of ML models with respect to real-world micrograph quality variations has not been carefully evaluated. We collected thousands of scanning electron microscopy (SEM) micrographs for molecular solid materials, in which image pixel intensities vary due to both the microstructure content and microscope instrument conditions. We then built ML models to predict the ultimate compressive strength (UCS) of consolidated molecular solids, by encoding micrographs with different image feature descriptors and training a random forest regressor, and by training an end-to-end deep-learning (DL) model. Results show that instrument-induced pixel intensity signals can affect ML model predictions in a consistently negative way. As a remedy, we explored intensity normalization techniques. It is seen that intensity normalization helps to improve micrograph data quality and ML model robustness, but microscope-induced intensity variations can be difficult to eliminate.

Download Full-text

Generating Pseudo-Data to Enhance the Performance of Classification-Based Engineering Design: A Preliminary Investigation

Volume 6: Design, Systems, and Complexity ◽

10.1115/imece2020-24634 ◽

2020 ◽

Author(s):

Xianping Du ◽

Onur Bilgen ◽

Hongyi Xu

Keyword(s):

Machine Learning ◽

Engineering Design ◽

Real World ◽

Surrogate Model ◽

Preliminary Investigation ◽

Learning Model ◽

Classification Model ◽

Design Decision ◽

Large Dataset ◽

Machine Learning Model

Abstract Machine learning for classification has been used widely in engineering design, for example, feasible domain recognition and hidden pattern discovery. Training an accurate machine learning model requires a large dataset; however, high computational or experimental costs are major issues in obtaining a large dataset for real-world problems. One possible solution is to generate a large pseudo dataset with surrogate models, which is established with a smaller set of real training data. However, it is not well understood whether the pseudo dataset can benefit the classification model by providing more information or deteriorates the machine learning performance due to the prediction errors and uncertainties introduced by the surrogate model. This paper presents a preliminary investigation towards this research question. A classification-and-regressiontree model is employed to recognize the design subspaces to support design decision-making. It is implemented on the geometric design of a vehicle energy-absorbing structure based on finite element simulations. Based on a small set of real-world data obtained by simulations, a surrogate model based on Gaussian process regression is employed to generate pseudo datasets for training. The results showed that the tree-based method could help recognize feasible design domains efficiently. Furthermore, the additional information provided by the surrogate model enhances the accuracy of classification. One important conclusion is that the accuracy of the surrogate model determines the quality of the pseudo dataset and hence, the improvements in the machine learning model.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Lifelong Spectral Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6045 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5867-5874

Author(s):

Gan Sun ◽

Yang Cong ◽

Qianqian Wang ◽

Jun Li ◽

Yun Fu

Keyword(s):

Machine Learning ◽

Real World ◽

Spectral Clustering ◽

State Of The Art ◽

Clustering Algorithms ◽

Orthogonal Basis ◽

Learning Framework ◽

The Past ◽

Benchmark Datasets ◽

Over Time

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

Download Full-text

Developing a machine learning environmental allergy prediction model from real world data through a novel decentralized mobile study platform.

10.1101/2020.09.21.20199224 ◽

2020 ◽

Author(s):

Chethan Sarabu ◽

Sandra Steyaert ◽

Nirav Shah

Keyword(s):

Machine Learning ◽

Real World ◽

Predictive Models ◽

Clinical Care ◽

Sensor Data ◽

Real World Data ◽

World Data ◽

Demographic Groups ◽

Prediction And Prevention ◽

Wide Range

Environmental allergies cause significant morbidity across a wide range of demographic groups. This morbidity could be mitigated through individualized predictive models capable of guiding personalized preventive measures. We developed a predictive model by integrating smartphone sensor data with symptom diaries maintained by patients. The machine learning model was found to be highly predictive, with an accuracy of 0.801. Such models based on real-world data can guide clinical care for patients and providers, reduce the economic burden of uncontrolled allergies, and set the stage for subsequent research pursuing allergy prediction and prevention. Moreover, this study offers proof-of-principle regarding the feasibility of building clinically useful predictive models from 'messy,' participant derived real-world data.

Download Full-text

Dear Watch, Should I get a COVID Test? Designing deployable machine learning for wearables

10.21203/rs.3.rs-505984/v1 ◽

2021 ◽

Author(s):

Anna Goldenberg ◽

Bret Nestor ◽

Jaryd Hunter ◽

Raghu Kainkaryam ◽

Erik Drysdale ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Real World ◽

False Positive Rate ◽

Wearable Devices ◽

Classification Performance ◽

Wearable Device ◽

Screening Tools ◽

Machine Learning Model ◽

Positive Rate

Abstract Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 +/- 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.

Download Full-text

Quantitative Toxicity Prediction via Ensembling of Heterogeneous Predictors

10.21203/rs.2.19338/v1 ◽

2019 ◽

Author(s):

Abdul Karim ◽

Vahid Riahi ◽

Avinash Mishra ◽

Abdollah Dehzangi ◽

M. A. Hakim Newton ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Models ◽

Individual Performance ◽

Learning Model ◽

Data Representation ◽

Toxicity Prediction ◽

Machine Learning Model ◽

Machine Learning Approach ◽

Benchmark Datasets

Abstract Representing molecules in the form of only one type of features and using those features to predict their activities is one of the most important approaches for machine-learning-based chemical-activity-prediction. For molecular activities like quantitative toxicity prediction, the performance depends on the type of features extracted and the machine learning approach used. For such cases, using one type of features and machine learning model restricts the prediction performance to specific representation and model used. In this paper, we study quantitative toxicity prediction and propose a machine learning model for the same. Our model uses an ensemble of heterogeneous predictors instead of typically using homogeneous predictors. The predictors that we use vary either on the type of features used or on the deep learning architecture employed. Each of these predictors presumably has its own strengths and weaknesses in terms of toxicity prediction. Our motivation is to make a combined model that utilizes different types of features and architectures to obtain better collective performance that could go beyond the performance of each individual predictor. We use six predictors in our model and test the model on four standard quantitative toxicity benchmark datasets. Experimental results show that our model outperforms the state-of-the-art toxicity prediction models in 8 out of 12 accuracy measures. Our experiments show that ensembling heterogeneous predictor improves the performance over single predictors and homogeneous ensembling of single predictors.The results show that each data representation or deep learning based predictor has its own strengths and weaknesses, thus employing a model ensembling multiple heterogeneous predictors could go beyond individual performance of each data representation or each predictor type.

Download Full-text

A Novel PCA-Firefly Based XGBoost Classification Model for Intrusion Detection in Networks Using GPU

Electronics ◽

10.3390/electronics9020219 ◽

2020 ◽

Vol 9 (2) ◽

pp. 219 ◽

Cited By ~ 37

Author(s):

Sweta Bhattacharya ◽

Siva Rama Krishnan S ◽

Praveen Kumar Reddy Maddikunta ◽

Rajesh Kaluri ◽

Saurabh Singh ◽

...

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Comprehensive Evaluation ◽

Detection System ◽

Human Life ◽

Principal Component ◽

Cyber Attacks ◽

Classification Model ◽

Learning Approaches ◽

Machine Learning Model

The enormous popularity of the internet across all spheres of human life has introduced various risks of malicious attacks in the network. The activities performed over the network could be effortlessly proliferated, which has led to the emergence of intrusion detection systems. The patterns of the attacks are also dynamic, which necessitates efficient classification and prediction of cyber attacks. In this paper we propose a hybrid principal component analysis (PCA)-firefly based machine learning model to classify intrusion detection system (IDS) datasets. The dataset used in the study is collected from Kaggle. The model first performs One-Hot encoding for the transformation of the IDS datasets. The hybrid PCA-firefly algorithm is then used for dimensionality reduction. The XGBoost algorithm is implemented on the reduced dataset for classification. A comprehensive evaluation of the model is conducted with the state of the art machine learning approaches to justify the superiority of our proposed approach. The experimental results confirm the fact that the proposed model performs better than the existing machine learning models.

Download Full-text

Fuzzy Machine Learning Model in Real-World Physical Domains; A State-of-the-Art Approach

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488521500446 ◽

2021 ◽

Author(s):

Sunita Dhote ◽

Rupesh Pais ◽

Chandan Vichoray ◽

S. Baskar

Keyword(s):

Machine Learning ◽

Real World ◽

State Of The Art ◽

Learning Model ◽

Machine Learning Model

Download Full-text

Usage of generative adversarial networks for edge prediction in ontology graphs

Informacionno-technologicheskij vestnik ◽

10.21499/2409-1650-2020-26-4-96-103 ◽

2020 ◽

pp. 96-103

Author(s):

О.П. Мосалов

Keyword(s):

Machine Learning ◽

Learning Model ◽

Computational Experiments ◽

Generative Adversarial Networks ◽

Adversarial Networks ◽

Machine Learning Model ◽

Parameter Values

Рассматривается модель машинного обучения для предсказания существования рёбер в графе онтологии, основанная на использовании генеративно-состязательной сети. Проведены вычислительные эксперименты для различных наборов значений гиперпараметров модели. Показано, что модель решает поставленную задачу. Сформулированы направления дальнейшего развития данного подхода. A machine learning model for edge existence prediction in an ontology graph, based on generative adversarial networks, is considered. Computational experiments for different sets of hyper parameter values are fulfilled. It is shown that the model solves the task. Further steps on this approach research are formulated.

Download Full-text