Missing values and lack of information in water management datasets: an approach based on Bayesian Networks

Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>

2021 ◽  
Vol 37 (2) ◽  
pp. 433-459
Author(s):  
Sander Scholtus ◽  
Jacco Daalmans

Abstract This article discusses methods for evaluating the variance of estimated frequency tables based on mass imputation. We consider a general set-up in which data may be available from both administrative sources and a sample survey. Mass imputation involves predicting the missing values of a target variable for the entire population. The motivating application for this article is the Dutch virtual population census, for which it has been proposed to use mass imputation to estimate tables involving educational attainment. We present a new analytical design-based variance estimator for a frequency table based on mass imputation. We also discuss a more general bootstrap method that can be used to estimate this variance. Both approaches are compared in a simulation study on artificial data and in an application to real data of the Dutch census of 2011.


2019 ◽  
Vol 96 (3) ◽  
pp. 1041-1065 ◽  
Author(s):  
Wilmer Rey ◽  
E. Tonatiuh Mendoza ◽  
Paulo Salles ◽  
Keqi Zhang ◽  
Yi-Chen Teng ◽  
...  

2015 ◽  
Vol 26 (6) ◽  
pp. 2586-2602 ◽  
Author(s):  
Irantzu Barrio ◽  
Inmaculada Arostegui ◽  
María-Xosé Rodríguez-Álvarez ◽  
José-María Quintana

When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.


2021 ◽  
Author(s):  
Heiko Apel ◽  
Sergiy Vorogushyn ◽  
Mostafa Farrag ◽  
Nguyen Viet Dung ◽  
Melanie Karremann ◽  
...  

<p>Urban flash floods caused by heavy convective precipitation pose an increasing threat to communes world-wide due to the increasing intensity and frequency of convective precipitation caused by a warming atmosphere. Thus, flood risk management plans adapted to the current flood risk but also capable of managing future risks are of high importance. These plans necessarily need model based pluvial flood risk simulations. In an urban environment these simulations have to have a high spatial and temporal resolution in order to site-specific management solutions. Moreover, the effect of the sewer systems needs to be included to achieve realistic inundation simulations, but also to assess the effectiveness of the sewer system and its fitness to future changes in the pluvial hazard. The setup of these models, however, typically requires a large amount of input data, a high degree of modelling expertise, a long time for setting up the model setup and to finally run the simulations. Therefor most communes cannot perform this task.</p><p> In order to provide model-based pluvial urban flood hazard and finally risk assessments for a large number of communes, the model system RIM<em>urban</em> was developed. The core of the system consists of a simplified raster-based 2D hydraulic model simulating the urban surface inundation in high spatial resolution. The model is implemented on GPUs for massive parallelization. The specific urban hydrology is considered by a capacity-based simulation of the sewer system and infiltration on non-sealed surfaces, and flow routing around buildings. The model thus considers the specific urban hydrological features, but with simplified approaches. Due to these simplifications the model setup can be performed with comparatively low data requirements, which can be covered with open data in most cases. The core data required are a high-resolution DEM, a layer of showing the buildings, and a land use map.</p><p>The spatially distributed rainfall input can be derived local precipitation records, or from an analysis of weather radar records of heavy precipitation events. A catalogue of heavy rain storms all over Germany is derived based on radar observations of the past 19 years. This catalogue serves as input for pluvial risk simulations for individual communes in Germany, as well as a catalogue of possible extreme events for the current climate. Future changes in these extreme events will be estimated based on regional climate simulations of a ΔT (1.5°C, 2°C) warmer world.</p><p>RIM<em>urban</em> simulates the urban inundation caused by these events, as well as the stress on the sewer system. Based on the inundation maps the damage to residential buildings will be estimated and further developed to a pluvial urban flood risk assessment. Because of the comparatively simple model structure and low data demand, the model setup can be easily automatized and transferred to most small to medium sized communes in Europe and even beyond, if the damage estimation is modified. RIM<em>urban</em> is thus seen as a generally appölicable screening tool for urban pluvial flood risk and a starting point for adapted risk management plans.</p>


Biometrika ◽  
2016 ◽  
Vol 103 (1) ◽  
pp. 175-187 ◽  
Author(s):  
Jun Shao ◽  
Lei Wang

Abstract To estimate unknown population parameters based on data having nonignorable missing values with a semiparametric exponential tilting propensity, Kim & Yu (2011) assumed that the tilting parameter is known or can be estimated from external data, in order to avoid the identifiability issue. To remove this serious limitation on the methodology, we use an instrument, i.e., a covariate related to the study variable but unrelated to the missing data propensity, to construct some estimating equations. Because these estimating equations are semiparametric, we profile the nonparametric component using a kernel-type estimator and then estimate the tilting parameter based on the profiled estimating equations and the generalized method of moments. Once the tilting parameter is estimated, so is the propensity, and then other population parameters can be estimated using the inverse propensity weighting approach. Consistency and asymptotic normality of the proposed estimators are established. The finite-sample performance of the estimators is studied through simulation, and a real-data example is also presented.


Author(s):  
Miroslav Hudec ◽  
Miljan Vučetić ◽  
Mirko Vujošević

Data mining methods based on fuzzy logic have been developed recently and have become an increasingly important research area. In this chapter, the authors examine possibilities for discovering potentially useful knowledge from relational database by integrating fuzzy functional dependencies and linguistic summaries. Both methods use fuzzy logic tools for data analysis, acquiring, and representation of expert knowledge. Fuzzy functional dependencies could detect whether dependency between two examined attributes in the whole database exists. If dependency exists only between parts of examined attributes' domains, fuzzy functional dependencies cannot detect its characters. Linguistic summaries are a convenient method for revealing this kind of dependency. Using fuzzy functional dependencies and linguistic summaries in a complementary way could mine valuable information from relational databases. Mining intensities of dependencies between database attributes could support decision making, reduce the number of attributes in databases, and estimate missing values. The proposed approach is evaluated with case studies using real data from the official statistics. Strengths and weaknesses of the described methods are discussed. At the end of the chapter, topics for further research activities are outlined.


2019 ◽  
Vol 29 (1) ◽  
pp. 178-188
Author(s):  
Getachew A Dagne

In clinical research and practice, there is often an interest in assessing the effect of time varying predictors, such as CD4/CD8 ratio, on immune recovery following antiretroviral therapy. Such predictors are measured with errors, and ignoring those measurement errors during data analysis may lead to biased results. Though parametric methods have been used for reducing biases, they usually depend on untestable assumptions. To relax those assumptions, this paper presents semiparametric mixed-effect models which deal with predictors having measurement errors and missing values. We develop a fully Bayesian approach for fitting these models and discriminating between patients who are potentially progressors or nonprogressors to severe disease condition (AIDS). The proposed methods are demonstrated using real data from an AIDS clinical study.


ISRN Agronomy ◽  
2013 ◽  
Vol 2013 ◽  
pp. 1-17 ◽  
Author(s):  
Sergio Arciniegas-Alarcón ◽  
Marisol García-Peña ◽  
Wojtek Janusz Krzanowski ◽  
Carlos Tadeu dos Santos Dias

This paper proposes five new imputation methods for unbalanced experiments with genotype by-environment interaction (G×E). The methods use cross-validation by eigenvector, based on an iterative scheme with the singular value decomposition (SVD) of a matrix. To test the methods, we performed a simulation study using three complete matrices of real data, obtained from G×E interaction trials of peas, cotton, and beans, and introducing lack of balance by randomly deleting in turn 10%, 20%, and 40% of the values in each matrix. The quality of the imputations was evaluated with the additive main effects and multiplicative interaction model (AMMI), using the root mean squared predictive difference (RMSPD) between the genotypes and environmental parameters of the original data set and the set completed by imputation. The proposed methodology does not make any distributional or structural assumptions and does not have any restrictions regarding the pattern or mechanism of missing values.


Sign in / Sign up

Export Citation Format

Share Document