SOFTWARE EFFORT ESTIMATION BY ANALOGY USING ATTRIBUTE SELECTION BASED ON ROUGH SET ANALYSIS

Author(s):  
JINGZHOU LI ◽  
GUENTHER RUHE

Estimation by analogy (EBA) predicts effort for a new project by learning from the performance of former projects. This is done by aggregating effort information of similar projects from a given historical data set that contains projects, or objects in general, and attributes describing the objects. While this has been successful in general, existing research results have shown that a carefully selected subset, as well as weighting, of the attributes may improve the performance of the estimation methods. In order to improve the estimation accuracy of our former proposed EBA method AQUA, which supports data sets that have non-quantitative and missing values, an attribute weighting method using rough set analysis is proposed in this paper. AQUA is thus extended to AQUA+ by incorporating the proposed attribute weighting and selection method. Better prediction accuracy was obtained by AQUA+ compared to AQUA for five data sets. The proposed method for attribute weighting and selection is effective in that (1) it supports data sets that have non-quantitative and missing values; (2) it supports attribute selection as well as weighting, which are not supported simultaneously by other attribute selection methods; and (3) it helps AQUA+ to produce better performance.

Geophysics ◽  
2018 ◽  
Vol 83 (4) ◽  
pp. M41-M48 ◽  
Author(s):  
Hongwei Liu ◽  
Mustafa Naser Al-Ali

The ideal approach for continuous reservoir monitoring allows generation of fast and accurate images to cope with the massive data sets acquired for such a task. Conventionally, rigorous depth-oriented velocity-estimation methods are performed to produce sufficiently accurate velocity models. Unlike the traditional way, the target-oriented imaging technology based on the common-focus point (CFP) theory can be an alternative for continuous reservoir monitoring. The solution is based on a robust data-driven iterative operator updating strategy without deriving a detailed velocity model. The same focusing operator is applied on successive 3D seismic data sets for the first time to generate efficient and accurate 4D target-oriented seismic stacked images from time-lapse field seismic data sets acquired in a [Formula: see text] injection project in Saudi Arabia. Using the focusing operator, target-oriented prestack angle domain common-image gathers (ADCIGs) could be derived to perform amplitude-versus-angle analysis. To preserve the amplitude information in the ADCIGs, an amplitude-balancing factor is applied by embedding a synthetic data set using the real acquisition geometry to remove the geometry imprint artifact. Applying the CFP-based target-oriented imaging to time-lapse data sets revealed changes at the reservoir level in the poststack and prestack time-lapse signals, which is consistent with the [Formula: see text] injection history and rock physics.


Author(s):  
Emilia Mendes

Although numerous studies on Web effort estimation have been carried out to date, there is no consensus on what constitutes the best effort estimation technique to be used by Web companies. It seems that not only the effort estimation technique itself can influence the accuracy of predictions, but also the characteristics of the data set used (e.g., skewness, collinearity; Shepperd & Kadoda, 2001). Therefore, it is often necessary to compare different effort estimation techniques, looking for those that provide the best estimation accuracy for the data set being employed. With this in mind, the use of graphical aids such as boxplots is not always enough to assess the existence of significant differences between effort prediction models. The same applies to measures of prediction accuracy such as the mean magnitude of relative error (MMRE), median magnitude of relative error (MdMRE), and prediction at level l (Pred[25]). Other techniques, which correspond to the group of statistical significance tests, need to be employed to check if the different residuals obtained for each of the effort estimation techniques compared come from the same population. This chapter details how to use such techniques and how their results should be interpreted.


Symmetry ◽  
2020 ◽  
Vol 12 (4) ◽  
pp. 669 ◽  
Author(s):  
Eunseo Oh ◽  
Hyunsoo Lee

The developments in the fields of industrial Internet of Things (IIoT) and big data technologies have made it possible to collect a lot of meaningful industrial process and quality-based data. The gathered data are analyzed using contemporary statistical methods and machine learning techniques. Then, the extracted knowledge can be used for predictive maintenance or prognostic health management. However, it is difficult to gather complete data due to several issues in IIoT, such as devices breaking down, running out of battery, or undergoing scheduled maintenance. Data with missing values are often ignored, as they may contain insufficient information from which to draw conclusions. In order to overcome these issues, we propose a novel, effective missing data handling mechanism for the concepts of symmetry principles. While other existing methods only attempt to estimate missing parts, the proposed method generates a whole set of data set using Gaussian process regression and a generative adversarial network. In order to prove the effectiveness of the proposed framework, we examine a real-world, industrial case involving an air pressure system (APS), where we use the proposed method to make quality predictions and compare the results with existing state-of-the-art estimation methods.


2013 ◽  
Vol 3 (4) ◽  
pp. 61-83 ◽  
Author(s):  
Eleftherios Tiakas ◽  
Apostolos N. Papadopoulos ◽  
Yannis Manolopoulos

The last years there is an increasing interest for query processing techniques that take into consideration the dominance relationship between items to select the most promising ones, based on user preferences. Skyline and top-k dominating queries are examples of such techniques. A skyline query computes the items that are not dominated, whereas a top-k dominating query returns the k items with the highest domination score. To enable query optimization, it is important to estimate the expected number of skyline items as well as the maximum domination value of an item. In this article, the authors provide an estimation for the maximum domination value under the dinstinct values and attribute independence assumptions. The authors provide three different methodologies for estimating and calculating the maximum domination value and the authors test their performance and accuracy. Among the proposed estimation methods, their method Estimation with Roots outperforms all others and returns the most accurate results. They also introduce the eliminating dimension, i.e., the dimension beyond which all domination values become zero, and the authors provide an efficient estimation of that dimension. Moreover, the authors provide an accurate estimation of the skyline cardinality of a data set.


Endocrinology ◽  
2019 ◽  
Vol 160 (10) ◽  
pp. 2395-2400 ◽  
Author(s):  
David J Handelsman ◽  
Lam P Ly

Abstract Hormone assay results below the assay detection limit (DL) can introduce bias into quantitative analysis. Although complex maximum likelihood estimation methods exist, they are not widely used, whereas simple substitution methods are often used ad hoc to replace the undetectable (UD) results with numeric values to facilitate data analysis with the full data set. However, the bias of substitution methods for steroid measurements is not reported. Using a large data set (n = 2896) of serum testosterone (T), DHT, estradiol (E2) concentrations from healthy men, we created modified data sets with increasing proportions of UD samples (≤40%) to which we applied five different substitution methods (deleting UD samples as missing and substituting UD sample with DL, DL/√2, DL/2, or 0) to calculate univariate descriptive statistics (mean, SD) or bivariate correlations. For all three steroids and for univariate as well as bivariate statistics, bias increased progressively with increasing proportion of UD samples. Bias was worst when UD samples were deleted or substituted with 0 and least when UD samples were substituted with DL/√2, whereas the other methods (DL or DL/2) displayed intermediate bias. Similar findings were replicated in randomly drawn small subsets of 25, 50, and 100. Hence, we propose that in steroid hormone data with ≤40% UD samples, substituting UD with DL/√2 is a simple, versatile, and reasonably accurate method to minimize left censoring bias, allowing for data analysis with the full data set.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


Author(s):  
Fatih Yücalar ◽  
Deniz Kilinc ◽  
Emin Borandag ◽  
Akin Ozcift

Estimating the development effort of a software project in the early stages of the software life cycle is a significant task. Accurate estimates help project managers to overcome the problems regarding budget and time overruns. This paper proposes a new multiple linear regression analysis based effort estimation method, which has brought a different perspective to the software effort estimation methods and increased the success of software effort estimation processes. The proposed method is compared with standard Use Case Point (UCP) method, which is a well-known method in this area, and simple linear regression based effort estimation method developed by Nassif et al. In order to evaluate and compare the proposed method, the data of 10 software projects developed by four well-established software companies in Turkey were collected and datasets were created. When effort estimations obtained from datasets and actual efforts spent to complete the projects are compared with each other, it has been observed that the proposed method has higher effort estimation accuracy compared to the other methods.


2012 ◽  
Vol 163 (4) ◽  
pp. 119-129
Author(s):  
Fabian Kostadinov ◽  
Renato Lemm ◽  
Oliver Thees

A software tool for the estimation of wood harvesting productivity using the kNN method For operational planning and management of wood harvests it is important to have access to reliable information on time consumption and costs. To estimate these efficiently and reliably, appropriate methods and calculation tools are needed. The present article investigates whether use of the method of the k nearest neighbours (kNN) is appropriate in this case. The kNN algorithm is first explained, then is applied to two sets of data “combined cable crane and processor” and “skidder”, both containing wood harvesting figures, and thus the estimation accuracy of the method is determined. It is shown that the kNN method's estimation accuracy lies within the same order of magnitude as that of a multiple linear regression. Advantages of the kNN method are that it is easy to understand and to visualize, together with the fact that estimation models do not become out of date, since new data sets can be constantly taken into account. The kNN Workbook has been developed by the Swiss Federal Institute for Forest, Snow and Landscape Research (WSL). It is a software tool with which any data set can be analysed in practice using the kNN method. This tool is also presented in the article.


2011 ◽  
Vol 2 (4) ◽  
pp. 12-23 ◽  
Author(s):  
Rekha Kandwal ◽  
Prerna Mahajan ◽  
Ritu Vijay

This paper revisits the problem of active learning and decision making when the cost of labeling incurs cost and unlabeled data is available in abundance. In many real world applications large amounts of data are available but the cost of correctly labeling it prohibits its use. In such cases, active learning can be employed. In this paper the authors propose rough set based clustering using active learning approach. The authors extend the basic notion of Hamming distance to propose a dissimilarity measure which helps in finding the approximations of clusters in the given data set. The underlying theoretical background for this decision is rough set theory. The authors have investigated our algorithm on the benchmark data sets from UCI machine learning repository which have shown promising results.


Sign in / Sign up

Export Citation Format

Share Document