A DATA-DRIVEN NONPARAMETRIC SPECIFICATION TEST FOR DYNAMIC REGRESSION MODELS

Accurate estimation of the degree of battery aging is essential to ensure safe operation of electric vehicles. In this paper, using real-world vehicles and their operational data, a battery aging estimation method is proposed based on a dual-polarization equivalent circuit (DPEC) model and multiple data-driven models. The DPEC model and the forgetting factor recursive least-squares method are used to determine the battery system’s ohmic internal resistance, with outliers being filtered using boxplots. Furthermore, eight common data-driven models are used to describe the relationship between battery degradation and the factors influencing this degradation, and these models are analyzed and compared in terms of both estimation accuracy and computational requirements. The results show that the gradient descent tree regression, XGBoost regression, and light GBM regression models are more accurate than the other methods, with root mean square errors of less than 6.9 mΩ. The AdaBoost and random forest regression models are regarded as alternative groups because of their relative instability. The linear regression, support vector machine regression, and k-nearest neighbor regression models are not recommended because of poor accuracy or excessively high computational requirements. This work can serve as a reference for subsequent battery degradation studies based on real-time operational data.

Download Full-text

Specification Test for Poisson Regression Models

International Economic Review ◽

10.2307/2526689 ◽

1986 ◽

Vol 27 (3) ◽

pp. 689 ◽

Cited By ~ 57

Author(s):

Lung-Fei Lee

Keyword(s):

Poisson Regression ◽

Regression Models ◽

Specification Test

Download Full-text

NeMoR: a New Method Based on Data-Driven for Neonatal Mortality Rate Forecasting

10.1101/2021.04.22.21255916 ◽

2021 ◽

Author(s):

Carlos Eduardo Beluzo ◽

Luciana Correia Alves ◽

Natália Martins Arruda ◽

Cátia Sepetauskas ◽

Everton Silva ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Neonatal Mortality ◽

Regression Models ◽

Mortality Rates ◽

Data Driven ◽

Health Policies ◽

Neonatal Mortality Rate ◽

Policy Makers ◽

Public Health Policies

ABSTRACTReduction in child mortality is one of the United Nations Sustainable Development Goals for 2030. In Brazil, despite recent reduction in child mortality in the last decades, the neonatal mortality is a persistent problem and it is associated with the quality of prenatal, childbirth care and social-environmental factors. In a proper health system, the effect of some of these factors could be minimized by the appropriate number of newborn intensive care units, number of health care units, number of neonatal incubators and even by the correct level of instruction of mothers, which can lead to a proper care along the prenatal period. With the intent of providing knowledge resources for planning public health policies focused on neonatal mortality reduction, we propose a new data-driven machine leaning method for Neonatal Mortality Rate forecasting called NeMoR, which predicts neonatal mortality rates for 4 months ahead, using NeoDeathForecast, a monthly base time series dataset composed by these factors and by neonatal mortality rates history (2006-2016), having 57,816 samples, for all 438 Brazilian administrative health regions. In order to build the model, Extra-Tree, XGBoost Regressor, Gradient Boosting Regressor and Lasso machine learning regression models were evaluated and a hyperparameters search was also performed as a fine tune step. The method has been validated using São Paulo city data, mainly because of data quality. On the better configuration the method predicted the neonatal mortality rates with a Mean Square Error lower than 0.18. Besides that, the forecast results may be useful as it provides a way for policy makers to anticipate trends on neonatal mortality rates curves, an important resource for planning public health policies.Graphical AbstractHighlightsProposition of a new data-driven approach for neonatal mortality rate forecast, which provides a way for policy-makers to anticipate trends on neonatal mortality rates curves, making a better planning of health policies focused on NMR reduction possible;a method for NMR forecasting with a MSE lower than 0.18;an extensive evaluation of different Machine Learning (ML) regression models, as well as hyperparameters search, which accounts for the last stage in NeMoR;a new time series database for NMR prediction problems;a new features projection space for NMR forecasting problems, which considerably reduces errors in NRM prediction.

Download Full-text

Regional regression models of percentile flows for the contiguous US: Expert versus data-driven independent variable selection

10.5194/hess-2016-639 ◽

2016 ◽

Author(s):

Geoffrey Fouad ◽

André Skupin ◽

Christina L. Tague

Keyword(s):

Regression Model ◽

Regression Models ◽

Predictive Performance ◽

Data Driven ◽

Mean Annual Precipitation ◽

Expert Assessment ◽

Independent Variables ◽

Regional Regression ◽

Data Driven Approach ◽

Small Set

Abstract. Percentile flows are statistics derived from the flow duration curve (FDC) that describe the flow equaled or exceeded for a given percent of time. These statistics provide important information for managing rivers, but are often unavailable since most basins are ungauged. A common approach for predicting percentile flows is to deploy regional regression models based on gauged percentile flows and related independent variables derived from physical and climatic data. The first step of this process identifies groups of basins through a cluster analysis of the independent variables, followed by the development of a regression model for each group. This entire process hinges on the independent variables selected to summarize the physical and climatic state of basins. Distributed physical and climatic datasets now exist for the contiguous United States (US). However, it remains unclear how to best represent these data for the development of regional regression models. The study presented here developed regional regression models for the contiguous US, and evaluated the effect of different approaches for selecting the initial set of independent variables on the predictive performance of the regional regression models. An expert assessment of the dominant controls on the FDC was used to identify a small set of independent variables likely related to percentile flows. A data-driven approach was also applied to evaluate two larger sets of variables that consist of either (1) the averages of data for each basin or (2) both the averages and statistical distribution of basin data distributed in space and time. The small set of variables from the expert assessment of the FDC and two larger sets of variables for the data-driven approach were each applied for a regional regression procedure. Differences in predictive performance were evaluated using 184 validation basins withheld from regression model development. The small set of independent variables selected through expert assessment produced similar, if not better, performance than the two larger sets of variables. A parsimonious set of variables only consisted of mean annual precipitation, potential evapotranspiration, and baseflow index. Additional variables in the two larger sets of variables added little to no predictive information. Regional regression models based on the parsimonious set of variables were developed using 734 calibration basins, and were converted into a tool for predicting 13 percentile flows in the contiguous US. Supplementary Material for this paper includes an R graphical user interface for predicting the percentile flows of basins within the range of conditions used to calibrate the regression models. The equations and performance statistics of the models are also supplied in tabular form.

Download Full-text

Remaining Useful Life Prediction of the Concrete Piston Based on Probability Statistics and Data Driven

Applied Sciences ◽

10.3390/app11188482 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8482

Author(s):

Jie Li ◽

Yuejin Tan ◽

Bingfeng Ge ◽

Hua Zhao ◽

Xin Lu

Keyword(s):

Inventory Management ◽

Regression Models ◽

Life Prediction ◽

Remaining Useful Life ◽

Data Driven ◽

Distribution Fitting ◽

Useful Life ◽

Concrete Pump Truck ◽

Concrete Pump ◽

Actual Life

This paper proposes a method on predicting the remaining useful life (RUL) of a concrete piston of a concrete pump truck based on probability statistics and data-driven approaches. Firstly, the average useful life of the concrete piston is determined by probability distribution fitting using actual life data. Secondly, according to condition monitoring data of the concrete pump truck, a concept of life coefficient of the concrete piston is proposed to represent the influence of the loading condition on the actual useful life of individual concrete pistons, and different regression models are established to predict the RUL of the concrete pistons. Finally, according to the prediction result of the concrete piston at different life stages, a replacement warning point is established to provide support for the inventory management and replacement plan of the concrete piston.

Download Full-text

Assessment of groundwater nitrate contamination hazard in a semi-arid region by using integrated parametric IPNOA and data-driven logistic regression models

Environmental Monitoring and Assessment ◽

10.1007/s10661-018-7013-8 ◽

2018 ◽

Vol 190 (11) ◽

Cited By ~ 17

Author(s):

Hossein Mojaddadi Rizeei ◽

Omer Saud Azeez ◽

Biswajeet Pradhan ◽

Hayder Hassan Khamees

Keyword(s):

Logistic Regression ◽

Arid Region ◽

Regression Models ◽

Nitrate Contamination ◽

Data Driven ◽

Logistic Regression Models ◽

Semi Arid Region ◽

Groundwater Nitrate Contamination ◽

Semi Arid

Download Full-text

Alternative Algorithm for Automatically Driving Best-Fit Building Energy Baseline Models Using a Data—Driven Grid Search

Sustainability ◽

10.3390/su11246976 ◽

2019 ◽

Vol 11 (24) ◽

pp. 6976

Author(s):

Suwon Song ◽

Chun Gun Park

Keyword(s):

Change Point ◽

Regression Models ◽

Building Energy ◽

Measured Data ◽

Data Driven ◽

Grid Search ◽

Point Model ◽

Change Point Model ◽

Best Fit ◽

Optimal Change

Change-point regression models are often used to develop building energy baselines that can be used to predict energy use and determine energy savings during a given performance period. However, the reliability of building energy baselines can depend on how well the change-point model fits the data measured during the baseline period. This research proposes the use of segmented linear regression models with one or two change points for automatically driving best-fit building energy baseline models, along with an algorithm using a data-driven grid search to find the optimal change point(s) within a given data boundary for the proposed models. The algorithm was programmed and tested with actual measured data (e.g., daily gas and electricity use) for case-study buildings. Graphical and statistical analysis was also performed to validate its reliability within acceptable deviations of an overall coefficient of variation of the root mean squared error (i.e., CV(RMSE)) of 1%, as compared to the results derived from the ASHRAE Inverse Model Toolkit (IMT) that was developed as a public domain program to manually derive the change-point model with user specified parameters. Consequently, it is expected that the algorithm can be applied for automatically deriving best-fit building energy baseline models with optimal change point(s) from measured data.

Download Full-text

Data-driven rate-optimal specification testing in regression models

The Annals of Statistics ◽

10.1214/009053604000001200 ◽

2005 ◽

Vol 33 (2) ◽

pp. 840-870 ◽

Cited By ~ 44

Author(s):

Emmanuel Guerre ◽

Pascal Lavergne

Keyword(s):

Regression Models ◽

Data Driven ◽

Specification Testing

Download Full-text

That’s Cool. Computational Sociolinguistic Methods for Investigating Individual Lexico-grammatical Variation

Frontiers in Artificial Intelligence ◽

10.3389/frai.2020.547531 ◽

2021 ◽

Vol 3 ◽

Author(s):

Hans-Jörg Schmid ◽

Quirin Würschinger ◽

Sebastian Fischer ◽

Helmut Küchenhoff

Keyword(s):

Individual Variation ◽

Regression Models ◽

Semantic Information ◽

Language Change ◽

Data Driven ◽

Social Variables ◽

Ideal Position ◽

Social Variation ◽

Grammatical Variation

The present study deals with variation in the use of lexico-grammatical patterns and emphasizes the need to embrace individual variation. Targeting the pattern that’s adj (as in that’s right, that’s nice or that’s okay) as a case study, we use a tailor-made Python script to systematically retrieve grammatical and semantic information about all instances of this construction in BNC2014 as well as sociolinguistic information enabling us to study social and individual lexico-grammatical variation among speakers who have used this pattern. The dataset amounts to 4,394 tokens produced by 445 speakers using 159 adjective types in 931 conversations. Using detailed descriptive statistics and mixed-effects regression models, we show that while the choice of some adjectives is partly determined by social variables, situational and especially individual variation is rampant overall. Adopting a cognitive-linguistic perspective and relying on the notion of entrenchment, we interpret these findings as reflecting individual speakers' routines. We argue that computational sociolinguistics is in an ideal position to contribute to the data-driven investigation of individual lexico-grammatical variation and encourage computational sociolinguists to grab this opportunity. For the routines of individual speakers ultimately both underlie and compromise systematic social variation and trigger and steer well-known types of language change including grammaticalization, pragmaticalization and change by invited inference.

Download Full-text

Exploring the Association between Compliance with Measures to Prevent the Spread of COVID-19 and Big Five Traits with Bayesian Generalized Linear Model

10.31234/osf.io/sf93m ◽

2021 ◽

Author(s):

Hyemin Han

Keyword(s):

Regression Model ◽

Linear Model ◽

Big Five ◽

Generalized Linear Model ◽

Regression Models ◽

Large Scale ◽

Negative Association ◽

Data Driven ◽

Big Five Traits ◽

Bayesian Generalized Linear Model

Research has examined the association between people’s compliance with measures to prevent the spread of COVID-19 and personality traits. However, previous studies were conducted with relatively small-size datasets and employed frequentist analysis that does not allow data-driven model exploration. To address the limitations, a large-scale international dataset, COVIDiSTRESS Global Survey dataset, was explored with Bayesian generalized linear model that enables identification of the best regression model. The best regression models predicting participants’ compliance with Big Five traits were explored. The findings demonstrated first, all Big Five traits, except extroversion, were positively associated with compliance with general measures and distancing. Second, neuroticism, extroversion, and agreeableness were positively associated with the perceived cost of complying with the measures while conscientiousness showed negative association. The findings and the implications of the present study were discussed.

Download Full-text