PERFORMANCE EVALUATION OF IMPUTATION METHODS FOR INCOMPLETE DATASETS

Author(s):  
SUMANTH YENDURI ◽  
S. S. IYENGAR

In this study, we compare the performance of four different imputation strategies ranging from the commonly used Listwise Deletion to model based approaches such as the Maximum Likelihood on enhancing completeness in incomplete software project data sets. We evaluate the impact of each of these methods by implementing them on six different real-time software project data sets which are classified into different categories based on their inherent properties. The reliability of the constructed data sets using these techniques are further tested by building prediction models using stepwise regression. The experimental results are noted and the findings are finally discussed.

1994 ◽  
Vol 33 (04) ◽  
pp. 390-396 ◽  
Author(s):  
J. G. Stewart ◽  
W. G. Cole

Abstract:Metaphor graphics are data displays designed to look like corresponding variables in the real world, but in a non-literal sense of “look like”. Evaluation of the impact of these graphics on human problem solving has twice been carried out, but with conflicting results. The present experiment attempted to clarify the discrepancies between these findings by using a complex task in which expert subjects interpreted respiratory data. The metaphor graphic display led to interpretations twice as fast as a tabular (flowsheet) format, suggesting that conflict between earlier studies is due either to differences in training or to differences in goodness of metaphor, Findings to date indicate that metaphor graphics work with complex as well as simple data sets, pattern detection as well as single number reporting tasks, and with expert as well as novice subjects.


2010 ◽  
Vol 09 (04) ◽  
pp. 547-573 ◽  
Author(s):  
JOSÉ BORGES ◽  
MARK LEVENE

The problem of predicting the next request during a user's navigation session has been extensively studied. In this context, higher-order Markov models have been widely used to model navigation sessions and to predict the next navigation step, while prediction accuracy has been mainly evaluated with the hit and miss score. We claim that this score, although useful, is not sufficient for evaluating next link prediction models with the aim of finding a sufficient order of the model, the size of a recommendation set, and assessing the impact of unexpected events on the prediction accuracy. Herein, we make use of a variable length Markov model to compare the usefulness of three alternatives to the hit and miss score: the Mean Absolute Error, the Ignorance Score, and the Brier score. We present an extensive evaluation of the methods on real data sets and a comprehensive comparison of the scoring methods.


2018 ◽  
Vol 232 ◽  
pp. 03017
Author(s):  
Jie Zhang ◽  
Gang Wang ◽  
Haobo Jiang ◽  
Fangzheng Zhao ◽  
Guilin Tian

Software Defect Prediction has been an important part of Software engineering research since the 1970s. This technique is used to calculate and analyze the measurement and defect information of the historical software module to complete the defect prediction of the new software module. Currently, most software defect prediction model is established on the basis of the same software project data set. The training date sets used to construct the model and the test data sets used to validate the model are from the same software projects. But in practice, for those has less historical data of a software project or new projects, the defect of traditional prediction method shows lower forecast performance. For the traditional method, when the historical data is insufficient, the software defect prediction model cannot be fully studied. It is difficult to achieve high prediction accuracy. In the process of cross-project prediction, the problem that we will faced is data distribution differences. For the above problems, this paper presents a software defect prediction model based on migration learning and traditional software defect prediction model. This model uses the existing project data sets to predict software defects across projects. The main work of this article includes: 1) Data preprocessing. This section includes data feature correlation analysis, noise reduction and so on, which effectively avoids the interference of over-fitting problem and noise data on prediction results. 2) Migrate learning. This section analyzes two different but related project data sets and reduces the impact of data distribution differences. 3) Artificial neural networks. According to class imbalance problems of the data set, using artificial neural network and dynamic selection training samples reduce the influence of prediction results because of the positive and negative samples data. The data set of the Relink project and AEEEM is studied to evaluate the performance of the f-measure and the ROC curve and AUC calculation. Experiments show that the model has high predictive performance.


Author(s):  
Zhichao Zhao ◽  
Jinguo You ◽  
Guoyu Gan ◽  
Xiaowu Li ◽  
Jiaman Ding

AbstractAirfare price prediction is one of the core facilities of the decision support system in civil aviation, which includes departure time, days of purchase in advance and flight airline. The traditional airfare price prediction system is limited by the nonlinear interrelationship of multiple factors and fails to deal with the impact of different time steps, resulting in low prediction accuracy. To address these challenges, this paper proposes a novel civil airline fare prediction system with a Multi-Attribute Dual-stage Attention (MADA) mechanism integrating different types of data extracted from the same dimension. In this method, the Seq2Seq model is used to add attention mechanisms to both the encoder and the decoder. The encoder attention mechanism extracts multi-attribute data from time series, which are optimized and filtered by the temporal attention mechanism in the decoder to capture the complex time dependence of the ticket price sequence. Extensive experiments with actual civil aviation data sets were performed, and the results suggested that MADA outperforms airfare prediction models based on the Auto-Regressive Integrated Moving Average (ARIMA), random forest, or deep learning models in MSE, RMSE, and MAE indicators. And from the results of a large amount of experimental data, it is proven that the prediction results of the MADA model proposed in this paper on different routes are at least 2.3% better than the other compared models.


2021 ◽  
Vol 4 ◽  
Author(s):  
Sebastian Jäger ◽  
Arndt Allhorn ◽  
Felix Bießmann

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.


2021 ◽  
Vol 28 (1) ◽  
Author(s):  
Daniel Honsel ◽  
Verena Herbold ◽  
Stephan Waack ◽  
Jens Grabowski

AbstractTo guide software development, the estimation of the impact of decision making on the development process can be helpful in planning. For this estimation, often prediction models are used which can be learned from project data. In this paper, an approach for the usage of agent-based simulation for the prediction of software evolution trends is presented. The specialty of the proposed approach lies in the automated parameter estimation for the instantiation of project-specific simulation models. We want to assess how well a baseline model using average (commit) behavior of the agents (i.e., the developers) performs compared to models where different amount of project-specific data is fed into the simulation model. The approach involves the interplay between the mining framework and simulation framework. Parameters to be estimated include, e.g., file change probabilities of developers and the team constellation reflecting different developer roles. The structural evolution of software projects is observed using change coupling graphs based on common file changes. For the validation of simulation results, we compare empirical with simulated results. Our results showed that an average simulation model can mimic general project growth trends like the number of commits and files well and thus, can help project managers in, e.g., controlling the onboarding of developers. Besides, the simulated co-change evolution could be improved significantly using project-specific data.


2016 ◽  
pp. 55-94
Author(s):  
Pier Luigi Marchini ◽  
Carlotta D'Este

The reporting of comprehensive income is becoming increasingly important. After the introduction of Other Comprehensive Income (OCI) reporting, as required by the 2007 IAS 1-revised, the IASB is currently seeking inputs from investors on the usefulness of unrealized gains and losses and on the role of comprehensive income. This circumstance is of particular relevance in code law countries, as local pre-IFRS accounting models influence financial statement preparers and users. This study aims at investigating the role played by unrealized gains and losses reporting on users' decision process, by examining the impact of OCI on the Italian listed companies RoE ratio and by surveying a sample of financial analysts, also content analysing their formal reports. The results show that the reporting of comprehensive income does not affect the financial statement users' decision process, although it statistically affects Italian listed entities' performance.


Sign in / Sign up

Export Citation Format

Share Document