Dynamic model updating (DMU) approach for statistical learning model building with missing data

Rahi Jain; Wei Xu

doi:10.1186/s12859-021-04138-z

Dynamic model updating (DMU) approach for statistical learning model building with missing data

BMC Bioinformatics ◽

10.1186/s12859-021-04138-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Missing Data ◽

Dynamic Model ◽

Statistical Models ◽

Missing Values ◽

Model Building ◽

Model Updating ◽

Biological Data ◽

Bayesian Regression ◽

Biological Research ◽

Original Dataset

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.

Get full-text (via PubEx)

The Effects of Missing Data Characteristics on the Choice of Imputation Techniques

Vietnam Journal of Computer Science ◽

10.1142/s2196888820500098 ◽

2020 ◽

Vol 07 (02) ◽

pp. 161-177

Author(s):

Oyekale Abel Alade ◽

Ali Selamat ◽

Roselina Sallehuddin

Keyword(s):

Missing Data ◽

Missing Values ◽

Health Management ◽

Support Vector ◽

Multiple Imputations ◽

Original Dataset ◽

Learning Machine ◽

Elm Classifier ◽

The Right

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.

Get full-text (via PubEx)

Learning Distributional Programs for Relational Autocompletion

Theory and Practice of Logic Programming ◽

10.1017/s1471068421000144 ◽

2021 ◽

pp. 1-34

Author(s):

NITESH KUMAR ◽

ONDŘEJ KUŽELKA ◽

LUC DE RAEDT

Keyword(s):

Missing Data ◽

Expectation Maximization ◽

Statistical Models ◽

Missing Values ◽

Probability Distributions ◽

Rule Learning ◽

Relational Data ◽

Probabilistic Logic ◽

Programming Framework ◽

Distinguishing Features

Abstract Relational autocompletion is the problem of automatically filling out some missing values in multi-relational data. We tackle this problem within the probabilistic logic programming framework of Distributional Clauses (DCs), which supports both discrete and continuous probability distributions. Within this framework, we introduce DiceML – an approach to learn both the structure and the parameters of DC programs from relational data (with possibly missing data). To realize this, DiceML integrates statistical modeling and DCs with rule learning. The distinguishing features of DiceML are that it (1) tackles autocompletion in relational data, (2) learns DCs extended with statistical models, (3) deals with both discrete and continuous distributions, (4) can exploit background knowledge, and (5) uses an expectation–maximization-based (EM) algorithm to cope with missing data. The empirical results show the promise of the approach, even when there is missing data.

Get full-text (via PubEx)

Generating Missing Oilfield Data Using A Generative Adversarial Imputation Network GAIN

10.2118/200766-ms ◽

2021 ◽

Author(s):

Justin Andrews ◽

Sheldon Gorell

Keyword(s):

Machine Learning ◽

Missing Data ◽

Missing Values ◽

Generative Adversarial Networks ◽

Imputation Methods ◽

Adversarial Networks ◽

Incomplete Observations ◽

Original Dataset ◽

Single Field ◽

Standard Practices

Abstract Missing values and incomplete observations can exist in just about ever type of recorded data. With analytical modeling, and machine learning in particular, the quantity and quality of available data is paramount to acquiring reliable results. Within the oil industry alone, priorities in which data is important can vary from company to company, leading to available knowledge of a single field to vary from place to place. With machine learning requiring very complete sets of data, this issue can require whole portions of data to be discarded in order to create an appropriate dataset. Value imputation has emerged as a valuable solution in cleaning up datasets, and as current technology has advanced new generative machine learning methods have been used to generate images and data that is all but indistinguishable from reality. Using an adaptation of the standard Generative Adversarial Networks (GAN) approach known as a Generative Adversarial Imputation Network (GAIN), this paper evaluates this method and other imputation methods for filling in missing values. Using a gathered fully observed set of data, smaller datasets with randomly masked missing values were generated to validate the effectiveness of the various imputation methods; allowing comparisons to be made against the original dataset. The study found that with various sizes of missing data percentages withing the sets, the "filled in" data could be used with surprising accuracy for further analytics. This paper compares GAIN along with several commonly used imputation methods against more standard practices such as data cropping or filling in with average values for filling in missing data. GAIN, as well as the various imputation methods described are quantified for there ability to fill in data. The study will discuss how the GAIN model can quickly provide the data necessary for analytical studies and prediction of results for future projects.

Get full-text (via PubEx)

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Get full-text (via PubEx)

Dynamic model updating for bolted flange joints in the pipe structure

Simulation and Modelling Methodologies, Technologies and Applications ◽

10.2495/smta140751 ◽

2014 ◽

Author(s):

Xue Zhai ◽

Qinggang Zhai ◽

Jianjun Wang

Keyword(s):

Dynamic Model ◽

Model Updating ◽

Bolted Flange

Get full-text (via PubEx)

Review of the Applications of Deep Learning in Bioinformatics

Current Bioinformatics ◽

10.2174/1574893615999200711165743 ◽

2021 ◽

Vol 15 (8) ◽

pp. 898-911

Author(s):

Yongqing Zhang ◽

Jianrong Yan ◽

Siyu Chen ◽

Meiqin Gong ◽

Dongrui Gao ◽

...

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

Biomedical Imaging ◽

State Of The Art ◽

Black Box ◽

Medical Data ◽

Biological Data ◽

High Dimensional ◽

Biological Research ◽

Process Data

Rapid advances in biological research over recent years have significantly enriched biological and medical data resources. Deep learning-based techniques have been successfully utilized to process data in this field, and they have exhibited state-of-the-art performances even on high-dimensional, nonstructural, and black-box biological data. The aim of the current study is to provide an overview of the deep learning-based techniques used in biology and medicine and their state-of-the-art applications. In particular, we introduce the fundamentals of deep learning and then review the success of applying such methods to bioinformatics, biomedical imaging, biomedicine, and drug discovery. We also discuss the challenges and limitations of this field, and outline possible directions for further research.

Get full-text (via PubEx)

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Get full-text (via PubEx)

Bootstrap joint prediction regions for sequences of missing values in spatio-temporal datasets

Computational Statistics ◽

10.1007/s00180-021-01099-y ◽

2021 ◽

Author(s):

Maria Lucia Parrella ◽

Giuseppina Albano ◽

Cira Perna ◽

Michele La Rocca

Keyword(s):

Missing Data ◽

Missing Values ◽

Sample Selection ◽

Sample Path ◽

Temporal Relationships ◽

Empirical Performance ◽

Spatio Temporal ◽

Joint Prediction ◽

Point Forecast ◽

Prediction Regions

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.

Get full-text (via PubEx)

A Comparative Study of Various Methods of Handling Missing Data in UNSODA

Agriculture ◽

10.3390/agriculture11080727 ◽

2021 ◽

Vol 11 (8) ◽

pp. 727

Author(s):

Yingpeng Fu ◽

Hongjian Liao ◽

Longlong Lv

Keyword(s):

Missing Data ◽

Missing Values ◽

Soil Property ◽

Particle Density ◽

Organic Matter Content ◽

Nonparametric Tests ◽

Matter Content ◽

Support Vector ◽

Soil Database ◽

Property Data

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.

Get full-text (via PubEx)

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Get full-text (via PubEx)