scholarly journals Attacking Data Transforming Learners at Training Time

Author(s):  
Scott Alfeld ◽  
Ara Vartanian ◽  
Lucas Newman-Johnson ◽  
Benjamin I.P. Rubinstein

While machine learning systems are known to be vulnerable to data-manipulation attacks at both training and deployment time, little is known about how to adapt attacks when the defender transforms data prior to model estimation. We consider the setting where the defender Bob first transforms the data then learns a model from the result; Alice, the attacker, perturbs Bob’s input data prior to him transforming it. We develop a general-purpose “plug and play” framework for gradient-based attacks based on matrix differentials, focusing on ordinary least-squares linear regression. This allows learning algorithms and data transformations to be paired and composed arbitrarily: attacks can be adapted through the use of the chain rule—analogous to backpropagation on neural network parameters—to compositional learning maps. Bestresponse attacks can be computed through matrix multiplications from a library of attack matrices for transformations and learners. Our treatment of linear regression extends state-ofthe-art attacks at training time, by permitting the attacker to affect both features and targets optimally and simultaneously. We explore several transformations broadly used across machine learning with a driving motivation for our work being autogressive modeling. There, Bob transforms a univariate time series into a matrix of observations and vector of target values which can then be fed into standard learners. Under this learning reduction, a perturbation from Alice to a single value of the time series affects features of several data points along with target values.

2021 ◽  
Vol 9 ◽  
Author(s):  
Geetha Mani ◽  
◽  
Joshi Kumar Viswanadhapalli ◽  
Albert Alexander Stonie ◽  
◽  
...  

Air is one of the most fundamental constituents for the sustenance of life on earth. The meteorological, traffic factors, consumption of non-renewable energy sources, and industrial parameters are steadily increasing air pollution. These factors affect the welfare and prosperity of life on earth; therefore, the nature of air quality in our environment needs to be monitored continuously. The Air Quality Index (AQI), which indicates air quality, is influenced by several individual factors such as the accumulation of NO2, CO, O3, PM2.5, SO2, and PM10. This research paper aims to predict and forecast the AQI with Machine Learning (ML) techniques, namely linear regression and time series analysis. Primarily,Multi Linear Regression (MLR) model, supervised machine learning, is developed to predict AQI. NO2, Ozone(O3), PM 2.5, and SO2 sensor output collected from Central Pollution Control Board (CPCB) – Chennai region, India feed as input features and optimized AQI calculated from sensor's output set as a target to train the regression model. The obtained model parameters are validated with new and unseen sensor output. The Key Performance Indices(KPI) like co-efficient of determination, root mean square error and mean absolute error were calculated to validate the model accuracy. The K-cross-fold validation for testing data of MLR was obtained as around 92%. Secondly, the Auto-Regressive Integrated Moving Average (ARIMA) time series model is applied to forecast the AQI. The obtained model parameters were validated with unseen data with a timestamp. The forecasted AQI value of the next 15 days lies in a 95 % confidence interval zone. The model accuracy of test data was obtained as more than 80%.


2019 ◽  
Vol 13 (1) ◽  
pp. 37-58
Author(s):  
Ilma Yuni Rosita ◽  
Lilis Imamah Ichdayati ◽  
Rizki Adi Puspita Sari

This study aims to analyze the factors that affect the volume of Indonesian cocoa exports to Malaysia. Multiple linear regression and ordinary least squares (OLS) were employed to analyze time series of data from 2005 until 2013. Based on the analysis, it is obtained that factors that significantly effect the volume of Indonesian cocoa exports to Malaysia with a significance level (α) five percent are the real prices of Indonesian cocoa exports to Malaysia and the real prices of cocoa beans the international market.


2021 ◽  
Author(s):  
Krishnapriya Subramanian

The objective of this thesis is to analyse the psychometric data using statistical and machine learning methods. Psychological data are analysed to predict illness and injury of athletes. Regression technique, one of the statistical processes for estimating the relationship among variables is used as basis of this thesis. We apply the linear regression, time series and logistics regression to predict illness and well-being. Our linear regression simulation results are mainly used, to understand the data well. By reviewing the results of linear regression, time series model is developed which predicts sickness one day ahead. The predicted values of this time series model are continuous. However, logistic regression can be used, to provide a probabilistic approach to predict the future levels as a categorical value. Hence we have developed a binomial logistics regression model, when observation variable is the type of dichotomous. Our simulation results show that this prediction model performs well. Our empirical studies also show that our method can act as early warning system for athletes.


2019 ◽  
Vol 11 (2) ◽  
pp. 161-182
Author(s):  
Ilma Yuni Rosita ◽  
Lilis Imamah Ichdayati ◽  
Rizki Adi Puspita Sari

This study aims to analyze the factors that affect the volume of Indonesian cocoa exports to Malaysia. Multiple linear regression and ordinary least squares (OLS) were employed to analyze time series of data from 2005 until 2013. Based on the analysis, it is obtained that factors that significantly effect the volume of Indonesian cocoa exports to Malaysia with a significance level (α) five percent are the real prices of Indonesian cocoa exports to Malaysia and the real prices of cocoa beans the international market.


2021 ◽  
Vol 13 (5) ◽  
pp. 934
Author(s):  
Floris Calkoen ◽  
Arjen Luijendijk ◽  
Cristian Rodriguez Rivero ◽  
Etienne Kras ◽  
Fedor Baart

Forecasting shoreline evolution for sandy coasts is important for sustainable coastal management, given the present-day increasing anthropogenic pressures and a changing future climate. Here, we evaluate eight different time-series forecasting methods for predicting future shorelines derived from historic satellite-derived shorelines. Analyzing more than 37,000 transects around the globe, we find that traditional forecast methods altogether with some of the evaluated probabilistic Machine Learning (ML) time-series forecast algorithms, outperform Ordinary Least Squares (OLS) predictions for the majority of the sites. When forecasting seven years ahead, we find that these algorithms generate better predictions than OLS for 54% of the transect sites, producing forecasts with, on average, 29% smaller Mean Squared Error (MSE). Importantly, this advantage is shown to exist over all considered forecast horizons, i.e., from 1 up to 11 years. Although the ML algorithms do not produce significantly better predictions than traditional time-series forecast methods, some proved to be significantly more efficient in terms of computation time. We further provide insight in how these ML algorithms can be improved so that they can be expected to outperform not only OLS regression, but also the traditional time-series forecast methods. These forecasting algorithms can be used by coastal engineers, managers, and scientists to generate future shoreline prediction at a global level and derive conclusions thereof.


2018 ◽  
Vol 7 (3.12) ◽  
pp. 960
Author(s):  
Anila. M ◽  
G Pradeepini

The most commonly used prediction technique is Ordinary Least Squares Regression (OLS Regression). It has been applied in many fields like statistics, finance, medicine, psychology and economics. Many people, specially Data Scientists using this technique know that it has not gone with enough training to apply it and should be checked why & when it can or can’t be applied.It’s not easy task to find or explain about why least square regression [1] is faced much criticism when trained and tried to apply it. In this paper, we mention firstly about fundamentals of linear regression and OLS regression along with that popularity of LS method, we present our analysis of difficulties & pitfalls that arise while OLS method is applied, finally some techniques for overcoming these problems.  


2020 ◽  
Author(s):  
Laura Martínez Ferrer ◽  
Maria Piles ◽  
Gustau Camps-Valls

<p>Providing accurate and spatially resolved predictions of crop yield is of utmost importance due to the rapid increase in the demand of biofuels and food in the foreseeable future. Satellite based remote sensing over agricultural areas allows monitoring crop development through key bio-geophysical variables such as the Enhanced Vegetation Index (EVI), sensitive to canopy greenness, the Vegetation Optical Depth (VOD), sensitive to biomass water-uptake dynamics, and Soil Moisture (SM), which provides direct information of plant available water. The aim of this work is to implement an automatic system for county-based crop yield estimation using time series from multisource satellite observations, meteorological data and available in situ surveys as supporting information. The spatio-temporal resolution of satellite and meteorological observations are fully exploited and synergistically combined for crop yield prediction using machine learning models. Linear and non-linear regression methods are used: least squares, LASSO, random forests, kernel machines and Gaussian processes. Here we are not only interested in the prediction skill, but also on understanding the relative relevance of the covariates. For this, we first study the importance of each feature separately and then propose a global model for operational monitoring of crop status using the most relevant agro-ecological drivers.</p><p> </p><p>We selected the Continental U.S. and a four-year time series dataset to perform the research study. Results reveal that the three satellite variables are complementary and that their combination with maximum temperature and precipitation from meteorological stations provides the best estimations. Interestingly, adding information about crop planted area also improved the predictions. A non-linear regression model based on Gaussian processes led to best results for all considered crops (soybean, corn and wheat), with high accuracy (low bias and correlation coefficients ranging from 0.75 to 0.92). The feature ranking allowed understanding the main drivers for crop monitoring and the underlying factors behind a prediction loss or gain.</p>


2021 ◽  
Author(s):  
Krishnapriya Subramanian

The objective of this thesis is to analyse the psychometric data using statistical and machine learning methods. Psychological data are analysed to predict illness and injury of athletes. Regression technique, one of the statistical processes for estimating the relationship among variables is used as basis of this thesis. We apply the linear regression, time series and logistics regression to predict illness and well-being. Our linear regression simulation results are mainly used, to understand the data well. By reviewing the results of linear regression, time series model is developed which predicts sickness one day ahead. The predicted values of this time series model are continuous. However, logistic regression can be used, to provide a probabilistic approach to predict the future levels as a categorical value. Hence we have developed a binomial logistics regression model, when observation variable is the type of dichotomous. Our simulation results show that this prediction model performs well. Our empirical studies also show that our method can act as early warning system for athletes.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Or Sheffet

Linear regression is one of the most prevalent techniques in machine learning; however, it is also common to use linear regression for its explanatory capabilities rather than label prediction. Ordinary Least Squares (OLS) is often used in statistics to establish a correlation between an attribute (e.g. gender) and a label (e.g. income) in the presence of other (potentially correlated) features. OLS assumes a particular model that randomly generates the data, and derives t-values - representing the likelihood of each real value to be the true correlation. Using t-values, OLS can release a confidence interval, which is an interval on the reals that is likely to contain the true correlation; and when this interval does not intersect the origin, we can reject the null hypothesis as it is likely that the true correlation is non-zero.Our work aims at achieving similar guarantees on data under differentially private estimators. First, we show that for well-spread data, the Gaussian Johnson-Lindenstrauss Transform (JLT) gives a very good approximation of t-values; secondly, when JLT approximates Ridge regression (linear regression with l2-regularization) we derive, under certain conditions, confidence intervals using the projected data; lastly, we derive, under different conditions, confidence intervals for the "Analyze Gauss" algorithm of Dwork et al (STOC 2014).


Author(s):  
Agbassou Guenoupkati ◽  
Adekunlé Akim Salami ◽  
Mawugno Koffi Kodjo ◽  
Kossi Napo

Time series forecasting in the energy sector is important to power utilities for decision making to ensure the sustainability and quality of electricity supply, and the stability of the power grid. Unfortunately, the presence of certain exogenous factors such as weather conditions, electricity price complicate the task using linear regression models that are becoming unsuitable. The search for a robust predictor would be an invaluable asset for electricity companies. To overcome this difficulty, Artificial Intelligence differs from these prediction methods through the Machine Learning algorithms which have been performing over the last decades in predicting time series on several levels. This work proposes the deployment of three univariate Machine Learning models: Support Vector Regression, Multi-Layer Perceptron, and the Long Short-Term Memory Recurrent Neural Network to predict the electricity production of Benin Electricity Community. In order to validate the performance of these different methods, against the Autoregressive Integrated Mobile Average and Multiple Regression model, performance metrics were used. Overall, the results show that the Machine Learning models outperform the linear regression methods. Consequently, Machine Learning methods offer a perspective for short-term electric power generation forecasting of Benin Electricity Community sources.


Sign in / Sign up

Export Citation Format

Share Document