scholarly journals Missing Values and Optimal Selection of an Imputation Method and Classification Algorithm to Improve the Accuracy of Ubiquitous Computing Applications

2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Jaemun Sim ◽  
Jonathan Sangyun Lee ◽  
Ohbyung Kwon

In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are often missing due to users’ concerns about privacy or lack of obligation to provide complete data. This data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. The performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. The characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances.

Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-24
Author(s):  
Kui Yu ◽  
Yajing Yang ◽  
Wei Ding

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.


Author(s):  
Juheng Zhang ◽  
Xiaoping Liu ◽  
Xiao-Bai Li

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2014 ◽  
Vol 39 (2) ◽  
pp. 107-127 ◽  
Author(s):  
Artur Matyja ◽  
Krzysztof Siminski

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.


Author(s):  
Hatice Uenal ◽  
David Hampel

Registries are indispensable in medical studies and provide the basis for reliable study results for research questions. Depending on the purpose of use, a high quality of data is a prerequisite. However, with increasing registry quality, costs also increase accordingly. Considering these time and cost factors, this work is an attempt to estimate the cost advantages of applying statistical tools to existing registry data, including quality evaluation. Results for quality analysis showed that there are unquestionable savings of millions in study costs by reducing the time horizon and saving on average € 523,126 for every reduced year. Replacing additionally the over 25 % missing data in some variables, data quality was immensely improved. To conclude, our findings showed dearly the importance of data quality and statistical input in avoiding biased conclusions due to incomplete data.


2020 ◽  
Vol 26 (7) ◽  
pp. 827-853
Author(s):  
Simon Vrhovec ◽  
Damjan Fujs ◽  
Luka Jelovčan ◽  
Anže Mihelič

There is a growing number of scientific papers reporting on case studies and action research published each year. Consequently, evaluating the quality of pilling up research reports is becoming increasingly challenging. Several approaches for evaluation of quality of the scientific outputs exist however they appear to be fairly time-consuming and/or adapted for other research designs. In this paper, we propose a reasonably light-weight structure-based approach for evaluating case study and action research reports (SAE-CSAR) based on eight key parts of a real-world research report: research question, case description, data collection, data analysis, ethical considerations, results, discussion and limitations. To evaluate the feasibility of the proposed approach, we conducted a systematic literature survey of papers reporting on real-world cybersecurity research. A total of N = 102 research papers were evaluated. Results suggest that SAE-CSAR is useful and relatively efficient, and may offer a thought-provoking insight into the studied field. Although there is a positive trend for the inclusion of data collection, data analysis and research questions in papers, there is still room for improvement suggesting that the field of real-world cybersecurity research did not mature yet. The presence of a discussion in a paper appears to affect most its citation count. However, it seems that it is not uniformly accepted what a discussion should include. This paper explores this and other issues related to paper structure and provides guidance on how to improve the quality of research reports.


Sensors ◽  
2019 ◽  
Vol 19 (19) ◽  
pp. 4172 ◽  
Author(s):  
Karel Dejmal ◽  
Petr Kolar ◽  
Josef Novotny ◽  
Alena Roubalova

An increasing number of individuals and institutions own or operate meteorological stations, but the resulting data are not yet commonly used in the Czech Republic. One of the main difficulties is the heterogeneity of measuring systems that puts in question the quality of outcoming data. Only after a thorough quality control of recorded data is it possible to proceed with for example a specific survey of variability of a chosen meteorological parameter in an urban or suburban region. The most commonly researched element in the given environment is air temperature. In the first phase, this paper focuses on the quality of data provided by amateur and institutional stations. The following analyses consequently work with already amended time series. Due to the nature of analyzed data and their potential use in the future it is opportune to assess the appropriateness of chronological and possibly spatial interpolation of missing values. The evaluation of seasonal variability of air temperature in the scale of Brno city and surrounding area in 2015–2017 demonstrates, that the enrichment of network of standard (professional) stations with new stations may significantly refine or even revise the current state of knowledge, for example in the case of urban heat island phenomena. A cluster analysis was applied in order to assess the impact of localization circumstances (station environment, exposition, etc.) as well as typological classification of the set of meteorological stations.


Sign in / Sign up

Export Citation Format

Share Document