scholarly journals Regional regression models of percentile flows for the contiguous US: Expert versus data-driven independent variable selection

2016 ◽  
Author(s):  
Geoffrey Fouad ◽  
André Skupin ◽  
Christina L. Tague

Abstract. Percentile flows are statistics derived from the flow duration curve (FDC) that describe the flow equaled or exceeded for a given percent of time. These statistics provide important information for managing rivers, but are often unavailable since most basins are ungauged. A common approach for predicting percentile flows is to deploy regional regression models based on gauged percentile flows and related independent variables derived from physical and climatic data. The first step of this process identifies groups of basins through a cluster analysis of the independent variables, followed by the development of a regression model for each group. This entire process hinges on the independent variables selected to summarize the physical and climatic state of basins. Distributed physical and climatic datasets now exist for the contiguous United States (US). However, it remains unclear how to best represent these data for the development of regional regression models. The study presented here developed regional regression models for the contiguous US, and evaluated the effect of different approaches for selecting the initial set of independent variables on the predictive performance of the regional regression models. An expert assessment of the dominant controls on the FDC was used to identify a small set of independent variables likely related to percentile flows. A data-driven approach was also applied to evaluate two larger sets of variables that consist of either (1) the averages of data for each basin or (2) both the averages and statistical distribution of basin data distributed in space and time. The small set of variables from the expert assessment of the FDC and two larger sets of variables for the data-driven approach were each applied for a regional regression procedure. Differences in predictive performance were evaluated using 184 validation basins withheld from regression model development. The small set of independent variables selected through expert assessment produced similar, if not better, performance than the two larger sets of variables. A parsimonious set of variables only consisted of mean annual precipitation, potential evapotranspiration, and baseflow index. Additional variables in the two larger sets of variables added little to no predictive information. Regional regression models based on the parsimonious set of variables were developed using 734 calibration basins, and were converted into a tool for predicting 13 percentile flows in the contiguous US. Supplementary Material for this paper includes an R graphical user interface for predicting the percentile flows of basins within the range of conditions used to calibrate the regression models. The equations and performance statistics of the models are also supplied in tabular form.

Author(s):  
Imran Shah ◽  
Tia Tate ◽  
Grace Patlewicz

Abstract Motivation Generalized Read-Across (GenRA) is a data-driven approach to estimate physico-chemical, biological or eco-toxicological properties of chemicals by inference from analogues. GenRA attempts to mimic a human expert’s manual read-across reasoning for filling data gaps about new chemicals from known chemicals with an interpretable and automated approach based on nearest-neighbors. A key objective of GenRA is to systematically explore different choices of input data selection and neighborhood definition to objectively evaluate predictive performance of automated read-across estimates of chemical properties. Results We have implemented genra-py as a python package that can be freely used for chemical safety analysis and risk assessment applications. Automated read-across prediction in genra-py conforms to the scikit-learn machine learning library's estimator design pattern, making it easy to use and integrate in computational pipelines. We demonstrate the data-driven application of genra-py to address two key human health risk assessment problems namely: hazard identification and point of departure estimation. Availability and implementation The package is available from github.com/i-shah/genra-py.


2021 ◽  
Author(s):  
Hyemin Han

Research has examined the association between people’s compliance with measures to prevent the spread of COVID-19 and personality traits. However, previous studies were conducted with relatively small-size datasets and employed frequentist analysis that does not allow data-driven model exploration. To address the limitations, a large-scale international dataset, COVIDiSTRESS Global Survey dataset, was explored with Bayesian generalized linear model that enables identification of the best regression model. The best regression models predicting participants’ compliance with Big Five traits were explored. The findings demonstrated first, all Big Five traits, except extroversion, were positively associated with compliance with general measures and distancing. Second, neuroticism, extroversion, and agreeableness were positively associated with the perceived cost of complying with the measures while conscientiousness showed negative association. The findings and the implications of the present study were discussed.


2019 ◽  
Vol 31 (1) ◽  
Author(s):  
Vyacheslav Lyubchich ◽  
Tatiana V. Lebedeva ◽  
Jeremy M. Testa

2020 ◽  
Vol 9 (3) ◽  
pp. 678-695
Author(s):  
Zuhur Alatawi

A business committed to CSR activities can establish a favourable reputation in the market hence this reputation can be used to mislead the market by making them rely on the financial reporting of the organisation. This study aimed to investigate the relationship between CSR and earnings quality for firms listed on FTSE 350. Besides, it aimed to explore the impact of CSR on the motivation of the management to improve the earnings quality or manage earnings. The research has applied LSDV regression and OLS regression on the data collected from 217 firms listed on the FTSE 350. The respective regression models applied by keeping earnings quality as a dependent variable and range of independent variables such as CSR, SIZE, GROWTH, LEVERAGE and ROA. Besides, the correlation coefficient has also been calculated despite, the result could not reveal the nature of the relationship between the variables hence regression model was applied. The results have revealed no relationship between earnings quality and CSR in the case of LSDV regression model. The same has been observed for the OLS model however, there exists a relatively significant relationship between earnings quality and LEVERAGE. Similar findings recorded for earnings quality and GROWTH.


Depression has been a main cause of mental illness. Depression results in vital impairment in lifestyle. A significant reason for suicidal cerebration is observed to be depression. Music varies the intensity of emotional experience by captivating the neurotransmitters and brain anatomy, including the brain’s dopaminergic projections. The popularity of using Regression Models in data analysis in both research and industry has driven the development of an array of prediction models. It relies on independent variables and can provide the prediction for the dependent variable. The paper outlines the development of a Regression model to get the depression score of a person based on the music the user listens to. A regression model is used to predict the depression score depending upon the data obtained from a varied span of individuals and the genre of music they have listened to. We generate a suitable report based on the depression score. The doctor can then use the report to give the necessary treatment to the depressed patient. With our research, we have obtained variance and r2 score of over 0.95.


Author(s):  
Samuel Oladimeji Sowol ◽  
Abdullahi Adinoyi Ibrahim ◽  
Daouda Sangare ◽  
Ismaila Omeiza Ibrahim ◽  
Francis I. Johnson

In response to the global COVID-19 pandemic, this work aims to understand the early time evolution and the spread of the disease outbreak with a data driven approach. To this effect, we applied Susceptible- Infective-Recovered/Removed (SIR) epidemiological model on the disease. Additionally, we used the Machine Learning linear regression model on the historical COVID-19 data to predict the earlier stage of the disease. The evolution of the disease spread with the Mathematical SIR model and Machine Learning regression model for time series forecasting of the COVID-19 data without, and with lags and trends, was able to capture the early spread of the disease. Consequently, we suggest that if using a more advanced epidemiological model, and sophisticated machine learning regression models on the COVID-19 data, we can understand, as well as predict the long time evolution of the disease spread.


Author(s):  
Miguel Angel Luque-Fernandez ◽  
Daniel Redondo-Sánchez ◽  
Camille Maringe

Receiver operating characteristic (ROC) analysis is used for comparing predictive models in both model selection and model evaluation. ROC analysis is often applied in clinical medicine and social science to assess the tradeoff between model sensitivity and specificity. After fitting a binary logistic or probit regression model with a set of independent variables, the predictive performance of this set of variables can be assessed by the area under the curve (AUC) from an ROC curve. An important aspect of predictive modeling (regardless of model type) is the ability of a model to generalize to new cases. Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample often results in an overly optimistic estimate of predictive performance. One can use K-fold cross-validation to generate a more realistic estimate of predictive performance in situations with a small number of observations. AUC is estimated iteratively for k samples (the “test” samples) that are independent of the sample used to predict the dependent variable (the “training” sample). cvauroc implements k-fold cross-validation for the AUC for a binary outcome after fitting a logit or probit regression model, averaging the AUCs corresponding to each fold, and bootstrapping the cross-validated AUC to obtain statistical inference and 95% confidence intervals. Furthermore, cvauroc optionally provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named _fit; the sensitivity and specificity for each of the levels of the predicted outcome, contained in two new variables named _sen and _spe; and the plot of the mean cross-validated AUC and k-fold ROC curves.


Energies ◽  
2020 ◽  
Vol 13 (24) ◽  
pp. 6654
Author(s):  
Stefano Villa ◽  
Claudio Sassanelli

Buildings are among the main protagonists of the world’s growing energy consumption, employing up to 45%. Wide efforts have been directed to improve energy saving and reduce environmental impacts to attempt to address the objectives fixed by policymakers in the past years. Meanwhile, new approaches using Machine Learning regression models surged in the modeling and simulation research context. This research develops and proposes an innovative data-driven black box predictive model for estimating in a dynamic way the interior temperature of a building. Therefore, the rationale behind the approach has been chosen based on two steps. First, an investigation of the extant literature on the methods to be considered for tests has been conducted, shrinking the field of investigation to non-recursive multi-step approaches. Second, the results obtained on a pilot case using various Machine Learning regression models in the multi-step approach have been assessed, leading to the choice of the Support Vector Regression model. The prediction mean absolute error on the pilot case is 0.1 ± 0.2 °C when the offset from the prediction instant is 15 min and grows slowly for further future instants, up to 0.3 ± 0.8 °C for a prediction horizon of 8 h. In the end, the advantages and limitations of the new data-driven multi-step approach based on the Support Vector Regression model are provided. Relying only on data related to external weather, interior temperature and calendar, the proposed approach is promising to be applicable to any type of building without needing as input specific geometrical/physical characteristics.


Sign in / Sign up

Export Citation Format

Share Document