Privacy-Preserving Distributed Linear Regression on High-Dimensional Data

Abstract We propose privacy-preserving protocols for computing linear regression models, in the setting where the training dataset is vertically distributed among several parties. Our main contribution is a hybrid multi-party computation protocol that combines Yao’s garbled circuits with tailored protocols for computing inner products. Like many machine learning tasks, building a linear regression model involves solving a system of linear equations. We conduct a comprehensive evaluation and comparison of different techniques for securely performing this task, including a new Conjugate Gradient Descent (CGD) algorithm. This algorithm is suitable for secure computation because it uses an efficient fixed-point representation of real numbers while maintaining accuracy and convergence rates comparable to what can be obtained with a classical solution using floating point numbers. Our technique improves on Nikolaenko et al.’s method for privacy-preserving ridge regression (S&P 2013), and can be used as a building block in other analyses. We implement a complete system and demonstrate that our approach is highly scalable, solving data analysis problems with one million records and one hundred features in less than one hour of total running time.

Download Full-text

Soybean Yield Estimation and Its Components: A Linear Regression Approach

Agriculture ◽

10.3390/agriculture10080348 ◽

2020 ◽

Vol 10 (8) ◽

pp. 348

Author(s):

Marcelo Chan Fu Wei ◽

José Paulo Molin

Keyword(s):

Linear Regression ◽

Mean Squared Error ◽

Limiting Factors ◽

Training Dataset ◽

Coefficient Of Determination ◽

Validation Dataset ◽

Learning Approaches ◽

Linear Regression Models ◽

Yield Estimation ◽

Soybean Yield

Soybean yield estimation is either based on yield monitors or agro-meteorological and satellite imagery data, but they present several limiting factors regarding on-farm decision level. Aware that machine learning approaches have been largely applied to estimate soybean yield and the availability of data regarding soybean yield and its components (number of grains (NG) and thousand grains weight (TGW)), there is an opportunity to study their relationships. The objective was to explore the relationships between soybean yield and its components, generate equations to estimate yield and evaluate its prediction accuracy. The training dataset was composed of soybean yield and its components’ data from 2010 to 2019. Linear regression models based on NG, TGW and yield were fitted on the training dataset and applied to a validation dataset composed of 58 on-field collected samples. It was found that globally TGW and NG presented weak (r = 0.50) and strong (r = 0.92) linear relationships with yield, respectively. In addition to that, applying the fitted models to the validation dataset, model based on NG presented the highest accuracy, coefficient of determination (R2) of 0.70, mean absolute error (MAE) of 639.99 kg ha−1 and root mean squared error (RMSE) of 726.67 kg ha−1.

Download Full-text

An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression

10.21203/rs.3.rs-598251/v1 ◽

2021 ◽

Author(s):

Xue Yu ◽

Yifan Sun ◽

Hai-Jun Zhou

Keyword(s):

Linear Regression ◽

Message Passing ◽

Greedy Algorithms ◽

Linear Equations ◽

Regression Coefficients ◽

High Dimensional ◽

Linear Regression Models ◽

Approximate Message Passing ◽

Highly Correlated ◽

Solution Accuracy

Abstract High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest-solution guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the least-squares solution of the recursively decimated linear equations, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outper-forms LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.

Download Full-text

CONVERGENCE RATES OF ESTIMATORS IN PARTIAL LINEAR REGRESSION MODELS WITH MA(∞) ERROR PROCESS

Communication in Statistics- Theory and Methods ◽

10.1081/sta-120017224 ◽

2002 ◽

Vol 31 (12) ◽

pp. 2251-2273 ◽

Cited By ~ 14

Author(s):

Xiaoqian Sun ◽

Jinhong You ◽

Gemai Chen ◽

Xian Zhou

Keyword(s):

Linear Regression ◽

Regression Models ◽

Convergence Rates ◽

Linear Regression Models ◽

Error Process ◽

Partial Linear

Download Full-text

Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.638 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Toan Ong ◽

Ibrahim Lazrig ◽

Indrajit Ray ◽

Indrakshi Ray ◽

Michael Kahn

Keyword(s):

Parallel Processing ◽

Record Linkage ◽

High Capacity ◽

Privacy Preserving ◽

Secure Computation ◽

Third Party ◽

Major Drawback ◽

Garbled Circuits ◽

Chunk Size ◽

Synthetic Datasets

IntroductionBloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency. Objectives and ApproachWe evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations. ResultsTwo synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator. Conclusion/ImplicationsThe PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented.

Download Full-text

FAKTOR DETERMINAN PENYALURAN KREDIT BANK PERSERO

Journal of Business Economics ◽

10.35760/eb.2018.v23i1.1812 ◽

2018 ◽

Vol 23 (1) ◽

pp. 60-71

Author(s):

Wigiyanti Masodah

Keyword(s):

Interest Rate ◽

Linear Regression ◽

Interest Rates ◽

Regression Models ◽

Linear Regression Models ◽

Negative Impacts ◽

Main Activity ◽

The Impact ◽

The Given ◽

Multiple Linear Regression Models

Offering credit is the main activity of a Bank. There are some considerations when a bank offers credit, that includes Interest Rates, Inflation, and NPL. This study aims to find out the impact of Variable Interest Rates, Inflation variables and NPL variables on credit disbursed. The object in this study is state-owned banks. The method of analysis in this study uses multiple linear regression models. The results of the study have shown that Interest Rates and NPL gave some negative impacts on the given credit. Meanwhile, Inflation variable does not have a significant effect on credit given. Keywords: Interest Rate, Inflation, NPL, offered Credit.

Download Full-text

A Note on the Estimation of Linear Regression Models with Heteroskedastic Measurement Errors

SSRN Electronic Journal ◽

10.2139/ssrn.295567 ◽

2002 ◽

Cited By ~ 2

Author(s):

Daniel G. Sullivan

Keyword(s):

Linear Regression ◽

Measurement Errors ◽

Regression Models ◽

Linear Regression Models

Download Full-text

Linear Regression Models for Interval-Valued Data using Log-transformation

Anais do 14. Congresso Brasileiro de Inteligência Computacional ◽

10.21528/cbic2019-3 ◽

2020 ◽

Author(s):

Nykolas Mayko Maia Barbosa ◽

João Paulo Pordeus Gomes ◽

César Lincoln Cavalcante Mattos ◽

Diêgo Farias Oliveira

Keyword(s):

Linear Regression ◽

Regression Models ◽

Linear Regression Models ◽

Log Transformation ◽

Interval Valued

Download Full-text

Folate Nutritional Status among Psoriasis Patients not Exposed to Antifolate Drug

Current Nutrition & Food Science ◽

10.2174/1573401314666180702100301 ◽

2020 ◽

Vol 16 (4) ◽

pp. 543-553

Author(s):

Luciana Y. Tomita ◽

Andréia C. da Costa ◽

Solange Andreoni ◽

Luiza K.M. Oyafuso ◽

Vânia D’Almeida ◽

...

Keyword(s):

Folic Acid ◽

Linear Regression ◽

Multiple Linear Regression Model ◽

Folate Intake ◽

Serum Folate ◽

Linear Regression Models ◽

Cross Sectional ◽

Folic Acid Fortification ◽

Plasma Folate ◽

Psoriasis Severity

Background: Folic acid fortification program has been established to prevent tube defects. However, concern has been raised among patients using anti-folate drug, i.e. psoriatic patients, a common, chronic, autoimmune inflammatory skin disease associated with obesity and smoking. Objective: To investigate dietary and circulating folate, vitamin B12 (B12) and homocysteine (hcy) in psoriatic subjects exposed to the national mandatory folic acid fortification program. Methods: Cross-sectional study using the Food Frequency Questionnaire, plasma folate, B12, hcy and psoriasis severity using the Psoriasis Area and Severity Index score. Median, interquartile ranges (IQRs) and linear regression models were conducted to investigate factors associated with plasma folate, B12 and hcy. Results: 82 (73%) mild psoriasis, 18 (16%) moderate and 12 (11%) severe psoriasis. 58% female, 61% non-white, 31% former smokers, and 20% current smokers. Median (IQRs) were 51 (40, 60) years. Only 32% reached the Estimated Average Requirement of folate intake. Folate and B12 deficiencies were observed in 9% and 6% of the blood sample respectively, but hyperhomocysteinaemia in 21%. Severity of psoriasis was negatively correlated with folate and B12 concentrations. In a multiple linear regression model, folate intake contributed positively to 14% of serum folate, and negative predictors were psoriasis severity, smoking habits and saturated fatty acid explaining 29% of circulating folate. Conclusion: Only one third reached dietary intake of folate, but deficiencies of folate and B12 were low. Psoriasis severity was negatively correlated with circulating folate and B12. Stopping smoking and a folate rich diet may be important targets for managing psoriasis.

Download Full-text

THE PREDICTIVE CONTENT OF DISAGGREGATED NORMAL INCOME: An Empirical Study in the JSX

Gadjah Mada International Journal of Business ◽

10.22146/gamaijb.5633 ◽

2003 ◽

Vol 5 (3) ◽

pp. 363 ◽

Cited By ~ 1

Author(s):

Slamet Sugiri

Keyword(s):

Linear Regression ◽

Regression Models ◽

Stock Exchange ◽

Manufacturing Firms ◽

Single Step ◽

Model Parameters ◽

Linear Regression Models ◽

Operating Income ◽

Predictive Content ◽

Multiple Step

The main objective of this study is to examine a hypothesis that the predictive content of normal income disaggregated into operating income and nonoperating income outperforms that of aggregated normal income in predicting future cash flow. To test the hypothesis, linear regression models are developed. The model parameters are estimated based on fifty-five manufacturing firms listed in the Jakarta Stock Exchange (JSX) up to the end of 1997.This study finds that empirical evidence supports the hypothesis. This evidence supports arguments that, in reporting income from continuing operations, multiple-step approach is preferred to single-step one.

Download Full-text

Análise da relação entre O3, NO e NO2 usando técnicas de regressão linear múltipla.

GEOgraphia ◽

10.22409/geographia.v20i43.1065 ◽

2018 ◽

Vol 20 (43) ◽

pp. 124

Author(s):

Amaury De Souza ◽

Priscilla V Ikefuti ◽

Ana Paula Garcia ◽

Debora A.S Santos ◽

Soetania Oliveira

Keyword(s):

Time Series ◽

Linear Regression ◽

Nitrogen Dioxide ◽

Multiple Linear Regression ◽

Quality Parameters ◽

Environmental Pollutant ◽

Environmental Research ◽

Linear Regression Models ◽

Mato Grosso ◽

The Impact

Análise e previsão de parâmetros de qualidade do ar são tópicos importantes da pesquisa atmosférica e ambiental atual, devido ao impacto causado pela poluição do ar na saúde humana. Este estudo examina a transformação do dióxido de nitrogênio (NO2) em ozônio (O3) no ambiente urbano, usando o diagrama de séries temporais. Foram utilizados dados de concentração de poluentes ambientais e variáveis meteorológicas para prever a concentração de O3 na atmosfera. Foi testado o emprego de modelos de regressão linear múltipla como ferramenta para a predição da concentração de O3. Os resultados indicam que o valor da temperatura e a presença de NO2 influenciam na concentração de O3 em Campo Grande, capital do Estado do Mato Grosso do Sul. Palavras-chave: Ozônio. Dióxido de nitrogênio. Séries cronológicas. Regressões. ANALYSIS OF THE RELATIONSHIP BETWEEN O3, NO AND NO2 USING MULTIPLE LINEAR REGRESSION TECHNIQUES.Abstract: Analysis and prediction of air quality parameters are important topics of current atmospheric and environmental research due to the impact caused by air pollution on human health. This study examines the transformation of nitrogen dioxide (NO2) into ozone (O3) in the urban environment, using the time series diagram. Environmental pollutant concentration and meteorological variables were used to predict the O3 concentration in the atmosphere. The use of multiple linear regression models was tested as a tool to predict O3 concentration. The results indicate that the temperature value and the presence of NO2 influence the O3 concentration in Campo Grande, capital of the State of Mato Grosso do Sul.Keywords: Ozone. Nitrogen dioxide. Time series. Regressions. ANÁLISIS DE LA RELACIÓN ENTRE O3, NO Y NO2 UTILIZANDO MÚLTIPLES TÉCNICAS DE REGRESIÓN LINEAL.Resumen: Análisis y previsión de los parámetros de calidad del aire son temas importantes de la actual investigación de la atmósfera y el medio ambiente, debido al impacto de la contaminación atmosférica sobre la salud humana. Este estudio examina la transformación del dióxido de nitrógeno (NO2) en ozono (O3) en el entorno urbano, utilizando el diagrama de series de tiempo. Las concentraciones de los contaminantes ambientales de datos y variables climáticas fueron utilizadas para predecir la concentración de O3 en la atmósfera. El uso de múltiples modelos de regresión lineal como herramienta para predecir la concentración de O3 se puso a prueba. Los resultados indican que el valor de la temperatura y la presencia de NO2 influyen en la concentración de O3 en Campo Grande, capital del Estado de Mato Grosso do Sul.Palabras clave: Ozono. Dióxido de nitrógeno. Series de tiempo. Regresiones.

Download Full-text