scholarly journals Bootstrap joint prediction regions for sequences of missing values in spatio-temporal datasets

Author(s):  
Maria Lucia Parrella ◽  
Giuseppina Albano ◽  
Cira Perna ◽  
Michele La Rocca

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.

2020 ◽  
Author(s):  
Alexandre Hippert-Ferrer ◽  
Yajing Yan ◽  
Philippe Bolon

<p>Time series analysis constitutes a thriving subject in satellite image derived displacement measurement, especially since the launching of Sentinel satellites which provide free and systematic satellite image acquisitions with extended spatial coverage and reduced revisiting time. Large volumes of satellite images are available for monitoring numerous targets at the Earth’s surface, which allows for significant improvements of the displacement measurement precision by means of advanced multi-temporal methods. However, satellite image derived displacement time series can suffer from missing data, which is mainly due to technical limitations of the ground displacement computation methods (e.g. offset tracking) and surface property changes from one acquisition to another. Missing data can hinder the full exploitation of the displacement time series, which can potentially weaken both knowledge and interpretation of the physical phenomenon under observation. Therefore, an efficient missing data imputation approach seems of particular importance for data completeness. In this work, an iterative method, namely extended Expectation Maximization - Empirical Orthogonal Functions (EM-EOF) is proposed to retrieve missing values in satellite image derived displacement time series. The method uses both spatial and temporal correlations in the displacement time series for reconstruction. For this purpose, the spatio-temporal covariance of the time series is iteratively estimated and decomposed into different EOF modes by solving the eigenvalue problem in an EM-like scheme. To determine the optimal number of EOFs modes, two robust metrics, the cross validation error and a confidence index obtained from eigenvalue uncertainty, are defined. The former metric is also used as a convergence criterion of the iterative update of the missing values. Synthetic simulations are first performed in order to demonstrate the ability of missing data imputation of the extended EM-EOF method in cases of complex displacement, gaps and noise behaviors. Then, the method is applied to time series of offset tracking displacement measurement of Sentinel-2 images acquired between January 2017 and September 2019 over Fox Glacier in the Southern Alps of New Zealand. Promising results confirm the efficiency of the extended EM-EOF method in missing data imputation of satellite image derived displacement time series.</p>


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rahi Jain ◽  
Wei Xu

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Agriculture ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 727
Author(s):  
Yingpeng Fu ◽  
Hongjian Liao ◽  
Longlong Lv

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nishith Kumar ◽  
Md. Aminul Hoque ◽  
Masahiro Sugimoto

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.


1957 ◽  
Vol 103 (433) ◽  
pp. 758-772 ◽  
Author(s):  
Victor Meyer ◽  
H. Gwynne Jones

Various investigations into the effects of brain injury on psychological test performance (Weisenburg and McBride, 1935; Patterson and Zangwill, 1944; Anderson, 1951; McFie and Piercy, 1952; Bauer and Becka, 1954; Milner, 1954) suggest the overall conclusion that patients with left hemisphere lesions are relatively poor at verbal tasks, while those with right-sided lesions do worst at practical tasks, particularly the manipulation of spatial or spatio-temporal relationships. Heilbfun's (1956) study confirmed that verbal deficits result from left-sided lesions but his left and right hemisphere groups produced almost identical scores on spatial tests. In so far as these workers paid attention to the specific sites of the lesions, their findings indicate that the pattern of test performance is a function of the hemisphere in which the lesion occurs rather than of its specific locus.


Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Mar Rodríguez-Girondo ◽  
Niels van den Berg ◽  
Michel H. Hof ◽  
Marian Beekman ◽  
Eline Slagboom

Abstract Background Although human longevity tends to cluster within families, genetic studies on longevity have had limited success in identifying longevity loci. One of the main causes of this limited success is the selection of participants. Studies generally include sporadically long-lived individuals, i.e. individuals with the longevity phenotype but without a genetic predisposition for longevity. The inclusion of these individuals causes phenotype heterogeneity which results in power reduction and bias. A way to avoid sporadically long-lived individuals and reduce sample heterogeneity is to include family history of longevity as selection criterion using a longevity family score. A main challenge when developing family scores are the large differences in family size, because of real differences in sibship sizes or because of missing data. Methods We discussed the statistical properties of two existing longevity family scores: the Family Longevity Selection Score (FLoSS) and the Longevity Relatives Count (LRC) score and we evaluated their performance dealing with differential family size. We proposed a new longevity family score, the mLRC score, an extension of the LRC based on random effects modeling, which is robust for family size and missing values. The performance of the new mLRC as selection tool was evaluated in an intensive simulation study and illustrated in a large real dataset, the Historical Sample of the Netherlands (HSN). Results Empirical scores such as the FLOSS and LRC cannot properly deal with differential family size and missing data. Our simulation study showed that mLRC is not affected by family size and provides more accurate selections of long-lived families. The analysis of 1105 sibships of the Historical Sample of the Netherlands showed that the selection of long-lived individuals based on the mLRC score predicts excess survival in the validation set better than the selection based on the LRC score . Conclusions Model-based score systems such as the mLRC score help to reduce heterogeneity in the selection of long-lived families. The power of future studies into the genetics of longevity can likely be improved and their bias reduced, by selecting long-lived cases using the mLRC.


Sign in / Sign up

Export Citation Format

Share Document