Learning Distributional Programs for Relational Autocompletion

Abstract Relational autocompletion is the problem of automatically filling out some missing values in multi-relational data. We tackle this problem within the probabilistic logic programming framework of Distributional Clauses (DCs), which supports both discrete and continuous probability distributions. Within this framework, we introduce DiceML – an approach to learn both the structure and the parameters of DC programs from relational data (with possibly missing data). To realize this, DiceML integrates statistical modeling and DCs with rule learning. The distinguishing features of DiceML are that it (1) tackles autocompletion in relational data, (2) learns DCs extended with statistical models, (3) deals with both discrete and continuous distributions, (4) can exploit background knowledge, and (5) uses an expectation–maximization-based (EM) algorithm to cope with missing data. The empirical results show the promise of the approach, even when there is missing data.

Download Full-text

Dynamic model updating (DMU) approach for statistical learning model building with missing data

BMC Bioinformatics ◽

10.1186/s12859-021-04138-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Missing Data ◽

Dynamic Model ◽

Statistical Models ◽

Missing Values ◽

Model Building ◽

Model Updating ◽

Biological Data ◽

Bayesian Regression ◽

Biological Research ◽

Original Dataset

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.

Download Full-text

CHR(PRISM)-based probabilistic logic learning

Theory and Practice of Logic Programming ◽

10.1017/s1471068410000207 ◽

2010 ◽

Vol 10 (4-6) ◽

pp. 433-447 ◽

Cited By ~ 10

Author(s):

JON SNEYERS ◽

WANNES MEERT ◽

JOOST VENNEKENS ◽

YOSHITAKA KAMEYA ◽

TAISUKE SATO

Keyword(s):

Programming Languages ◽

Programming Language ◽

Expectation Maximization ◽

Statistical Models ◽

Operational Semantics ◽

Probabilistic Logic ◽

Parameter Learning ◽

Rewrite Rules ◽

High Level ◽

Distribution Semantics

AbstractPRISM is an extension of Prolog with probabilistic predicates and built-in support for expectation-maximization learning. Constraint Handling Rules (CHR) is a high-level programming language based on multi-headed multiset rewrite rules.In this paper, we introduce a new probabilistic logic formalism, called CHRiSM, based on a combination of CHR and PRISM. It can be used for high-level rapid prototyping of complex statistical models by means of “chance rules”. The underlying PRISM system can then be used for several probabilistic inference tasks, including probability computation and parameter learning. We define the CHRiSM language in terms of syntax and operational semantics, and illustrate it with examples. We define the notion of ambiguous programs and define a distribution semantics for unambiguous programs. Next, we describe an implementation of CHRiSM, based on CHR(PRISM). We discuss the relation between CHRiSM and other probabilistic logic programming languages, in particular PCHR. Finally, we identify potential application domains.

Download Full-text

Imputing missing distances in molecular phylogenetics

10.1101/276345 ◽

2018 ◽

Author(s):

Xuhua Xia

Keyword(s):

Missing Data ◽

Least Squares ◽

Molecular Clock ◽

Expectation Maximization ◽

Molecular Phylogenetics ◽

Missing Values ◽

Expectation Maximization Algorithm ◽

Distance Matrix ◽

Minimum Evolution ◽

Least Squares Approach

AbstractMissing data are frequently encountered in molecular phylogenetics and need to be imputed. For a distance matrix with missing distances, the least-squares approach is often used for imputing the missing values. Here I develop a method, similar to the expectation-maximization algorithm, to impute multiple missing distance in a distance matrix. I show that, for inferring the best tree and missing distances, the minimum evolution criterion is not as desirable as the least-squares criterion. I also discuss the problem involving cases where the missing values cannot be uniquely determined, e.g., when a missing distance involve two sister taxa. The new method has the advantage over the existing one in that it does not assume a molecular clock. I have implemented the function in DAMBE software which is freely available at available at http://dambe.bio.uottawa.ca

Download Full-text

An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data

10.20944/preprints201612.0078.v1 ◽

2016 ◽

Cited By ~ 1

Author(s):

Yuzhe Liu ◽

Vanathi Gopalakrishnan

Keyword(s):

Machine Learning ◽

Missing Data ◽

Clinical Research ◽

Sample Size ◽

Missing Values ◽

Rule Learning ◽

Supervised Machine Learning ◽

Imaging Data ◽

K Nearest Neighbors ◽

Imputation Methods

Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

Download Full-text

Data Imputation Methods for Missing Values in the Context of Clustering

Big Data and Knowledge Sharing in Virtual Organizations - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-7519-1.ch011 ◽

2019 ◽

pp. 240-274

Author(s):

Mehmet S. Aktaş ◽

Sinan Kaplan ◽

Hasan Abacı ◽

Oya Kalipsiz ◽

Utku Ketenci ◽

...

Keyword(s):

Missing Data ◽

Expectation Maximization ◽

Missing Values ◽

Nearest Neighbor ◽

Real Life ◽

Data Imputation ◽

K Nearest Neighbor ◽

Missing Data Imputation ◽

Data Scarcity ◽

Imputation Methods

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Bootstrap joint prediction regions for sequences of missing values in spatio-temporal datasets

Computational Statistics ◽

10.1007/s00180-021-01099-y ◽

2021 ◽

Author(s):

Maria Lucia Parrella ◽

Giuseppina Albano ◽

Cira Perna ◽

Michele La Rocca

Keyword(s):

Missing Data ◽

Missing Values ◽

Sample Selection ◽

Sample Path ◽

Temporal Relationships ◽

Empirical Performance ◽

Spatio Temporal ◽

Joint Prediction ◽

Point Forecast ◽

Prediction Regions

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.

Download Full-text

A Comparative Study of Various Methods of Handling Missing Data in UNSODA

Agriculture ◽

10.3390/agriculture11080727 ◽

2021 ◽

Vol 11 (8) ◽

pp. 727

Author(s):

Yingpeng Fu ◽

Hongjian Liao ◽

Longlong Lv

Keyword(s):

Missing Data ◽

Missing Values ◽

Soil Property ◽

Particle Density ◽

Organic Matter Content ◽

Nonparametric Tests ◽

Matter Content ◽

Support Vector ◽

Soil Database ◽

Property Data

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text