On Estimating the Maximum Domination Value and the Skyline Cardinality of Multi-Dimensional Data Sets

2013 ◽  
Vol 3 (4) ◽  
pp. 61-83 ◽  
Author(s):  
Eleftherios Tiakas ◽  
Apostolos N. Papadopoulos ◽  
Yannis Manolopoulos

The last years there is an increasing interest for query processing techniques that take into consideration the dominance relationship between items to select the most promising ones, based on user preferences. Skyline and top-k dominating queries are examples of such techniques. A skyline query computes the items that are not dominated, whereas a top-k dominating query returns the k items with the highest domination score. To enable query optimization, it is important to estimate the expected number of skyline items as well as the maximum domination value of an item. In this article, the authors provide an estimation for the maximum domination value under the dinstinct values and attribute independence assumptions. The authors provide three different methodologies for estimating and calculating the maximum domination value and the authors test their performance and accuracy. Among the proposed estimation methods, their method Estimation with Roots outperforms all others and returns the most accurate results. They also introduce the eliminating dimension, i.e., the dimension beyond which all domination values become zero, and the authors provide an efficient estimation of that dimension. Moreover, the authors provide an accurate estimation of the skyline cardinality of a data set.

Geophysics ◽  
2018 ◽  
Vol 83 (4) ◽  
pp. M41-M48 ◽  
Author(s):  
Hongwei Liu ◽  
Mustafa Naser Al-Ali

The ideal approach for continuous reservoir monitoring allows generation of fast and accurate images to cope with the massive data sets acquired for such a task. Conventionally, rigorous depth-oriented velocity-estimation methods are performed to produce sufficiently accurate velocity models. Unlike the traditional way, the target-oriented imaging technology based on the common-focus point (CFP) theory can be an alternative for continuous reservoir monitoring. The solution is based on a robust data-driven iterative operator updating strategy without deriving a detailed velocity model. The same focusing operator is applied on successive 3D seismic data sets for the first time to generate efficient and accurate 4D target-oriented seismic stacked images from time-lapse field seismic data sets acquired in a [Formula: see text] injection project in Saudi Arabia. Using the focusing operator, target-oriented prestack angle domain common-image gathers (ADCIGs) could be derived to perform amplitude-versus-angle analysis. To preserve the amplitude information in the ADCIGs, an amplitude-balancing factor is applied by embedding a synthetic data set using the real acquisition geometry to remove the geometry imprint artifact. Applying the CFP-based target-oriented imaging to time-lapse data sets revealed changes at the reservoir level in the poststack and prestack time-lapse signals, which is consistent with the [Formula: see text] injection history and rock physics.


Endocrinology ◽  
2019 ◽  
Vol 160 (10) ◽  
pp. 2395-2400 ◽  
Author(s):  
David J Handelsman ◽  
Lam P Ly

Abstract Hormone assay results below the assay detection limit (DL) can introduce bias into quantitative analysis. Although complex maximum likelihood estimation methods exist, they are not widely used, whereas simple substitution methods are often used ad hoc to replace the undetectable (UD) results with numeric values to facilitate data analysis with the full data set. However, the bias of substitution methods for steroid measurements is not reported. Using a large data set (n = 2896) of serum testosterone (T), DHT, estradiol (E2) concentrations from healthy men, we created modified data sets with increasing proportions of UD samples (≤40%) to which we applied five different substitution methods (deleting UD samples as missing and substituting UD sample with DL, DL/√2, DL/2, or 0) to calculate univariate descriptive statistics (mean, SD) or bivariate correlations. For all three steroids and for univariate as well as bivariate statistics, bias increased progressively with increasing proportion of UD samples. Bias was worst when UD samples were deleted or substituted with 0 and least when UD samples were substituted with DL/√2, whereas the other methods (DL or DL/2) displayed intermediate bias. Similar findings were replicated in randomly drawn small subsets of 25, 50, and 100. Hence, we propose that in steroid hormone data with ≤40% UD samples, substituting UD with DL/√2 is a simple, versatile, and reasonably accurate method to minimize left censoring bias, allowing for data analysis with the full data set.


Entropy ◽  
2020 ◽  
Vol 22 (2) ◽  
pp. 186 ◽  
Author(s):  
Ki-Soon Yu ◽  
Sung-Hyun Kim ◽  
Dae-Woon Lim ◽  
Young-Sik Kim

In this paper, we propose an intrusion detection system based on the estimation of the Rényi entropy with multiple orders. The Rényi entropy is a generalized notion of entropy that includes the Shannon entropy and the min-entropy as special cases. In 2018, Kim proposed an efficient estimation method for the Rényi entropy with an arbitrary real order α . In this work, we utilize this method to construct a multiple order, Rényi entropy based intrusion detection system (IDS) for vehicular systems with various network connections. The proposed method estimates the Rényi entropies simultaneously with three distinct orders, two, three, and four, based on the controller area network (CAN)-IDs of consecutively generated frames. The collected frames are split into blocks with a fixed number of frames, and the entropies are evaluated based on these blocks. For a more accurate estimation against each type of attack, we also propose a retrospective sliding window method for decision of attacks based on the estimated entropies. For fair comparison, we utilized the CAN-ID attack data set generated by a research team from Korea University. Our results show that the proposed method can show the false negative and positive errors of less than 1% simultaneously.


Author(s):  
Xingjie Fang ◽  
Liping Wang ◽  
Don Beeson ◽  
Gene Wiggs

Radial Basis Function (RBF) metamodels have recently attracted increased interest due to their significant advantages over other types of non-parametric metamodels. However, because of the interpolation nature of the RBF mathematics, the accuracy of the model may dramatically deteriorate if the training data set used contains duplicate information, noise or outliers. Also constructing the metamodel may be time consuming whenever the training data sets are large or a high dimensional model is required. In this paper, we propose a robust and efficient RBF metamodeling approach based on data pre-processing techniques that alleviate the accuracy and efficiency issues commonly encountered when RBF models are used in typical real engineering situations. These techniques include 1) the removal of duplicate training data information, 2) the generation of smaller uniformly distributed subsets of training data from large data sets and 3) the quantification and identification of outliers by principal component analysis (PCA) and Hotelling statistics. Simulation results are used to validate the generalization accuracy and efficiency of the proposed approach.


2018 ◽  
Vol 52 (1) ◽  
pp. 43-59
Author(s):  
AMULYA KUMAR MAHTO ◽  
YOGESH MANI TRIPATH ◽  
SANKU DEY

Burr type X distribution is one of the members of the Burr family which was originally derived by Burr (1942) and can be used quite effectively in modelling strength data and also general lifetime data. In this article, we consider efficient estimation of the probability density function (PDF) and cumulative distribution function (CDF) of Burr X distribution. Eight different estimation methods namely maximum likelihood estimation, uniformly minimum variance unbiased estimation, least square estimation, weighted least square estimation, percentile estimation, maximum product estimation, Cremer-von-Mises estimation and Anderson-Darling estimation are considered. Analytic expressions for bias and mean squared error are derived. Monte Carlo simulations are performed to compare the performances of the proposed methods of estimation for both small and large samples. Finally, a real data set has been analyzed for illustrative purposes.


2020 ◽  
Author(s):  
Michał Ciach ◽  
Błażej Miasojedow ◽  
Grzegorz Skoraczyński ◽  
Szymon Majewski ◽  
Michał Startek ◽  
...  

AbstractA common theme in many applications of computational mass spectrometry is fitting a linear combination of reference spectra to an experimental one in order to estimate the quantities of different ions, potentially with overlapping isotopic envelopes. In this work, we study this procedure in an abstract setting, in order to develop new approaches applicable to a diverse range of experiments. We introduce an application of a new spectral dissimilarity measure, known in other fields as the Wasserstein or the Earth Mover’s distance, in order to overcome the sensitivity of ordinary linear regression to measurement inaccuracies. Usinga a data set of 200 mass spectra, we demonstrate that our approach is capable of accurate estimation of ion proportions without extensive pre-processing required for state-of-the-art methods. The conclusions are further substantiated using data sets simulated in a way that mimics most of the measurement inaccuracies occurring in real experiments. We have implemented our methods in a Python 3 package, freely available at https://github.com/mciach/masserstein.


2018 ◽  
Vol 7 (6) ◽  
pp. 33
Author(s):  
Morteza Marzjarani

Selecting a proper model for a data set is a challenging task. In this article, an attempt was made to answer and to find a suitable model for a given data set. A general linear model (GLM) was introduced along with three different methods for estimating the parameters of the model. The three estimation methods considered in this paper were ordinary least squares (OLS), generalized least squares (GLS), and feasible generalized least squares (FGLS). In the case of GLS, two different weights were selected for improving the severity of heteroscedasticity and the proper weight (s) was deployed. The third weight was selected through the application of FGLS. Analyses showed that only two of the three weights including the FGLS were effective in improving or reducing the severity of heteroscedasticity. In addition, each data set was divided into Training, Validation, and Testing producing a more reliable set of estimates for the parameters in the model. Partitioning data is a relatively new approach is statistics borrowed from the field of machine learning. Stepwise and forward selection methods along with a number of statistics including the average square error testing (ASE), Adj. R-Sq, AIC, AICC, and ASE validate along with proper hierarchies were deployed to select a more appropriate model(s) for a given data set. Furthermore, the response variable in both data files was transformed using the Box-Cox method to meet the assumption of normality. Analysis showed that the logarithmic transformation solved this issue in a satisfactory manner. Since the issues of heteroscedasticity, model selection, and partitioning of data have not been addressed in fisheries, for introduction and demonstration purposes only, the 2015 and 2016 shrimp data in the Gulf of Mexico (GOM) were selected and the above methods were applied to these data sets. At the conclusion, some variations of the GLM were identified as possible leading candidates for the above data sets.


Author(s):  
JINGZHOU LI ◽  
GUENTHER RUHE

Estimation by analogy (EBA) predicts effort for a new project by learning from the performance of former projects. This is done by aggregating effort information of similar projects from a given historical data set that contains projects, or objects in general, and attributes describing the objects. While this has been successful in general, existing research results have shown that a carefully selected subset, as well as weighting, of the attributes may improve the performance of the estimation methods. In order to improve the estimation accuracy of our former proposed EBA method AQUA, which supports data sets that have non-quantitative and missing values, an attribute weighting method using rough set analysis is proposed in this paper. AQUA is thus extended to AQUA+ by incorporating the proposed attribute weighting and selection method. Better prediction accuracy was obtained by AQUA+ compared to AQUA for five data sets. The proposed method for attribute weighting and selection is effective in that (1) it supports data sets that have non-quantitative and missing values; (2) it supports attribute selection as well as weighting, which are not supported simultaneously by other attribute selection methods; and (3) it helps AQUA+ to produce better performance.


1995 ◽  
Vol 43 (2) ◽  
pp. 101-110
Author(s):  
Gustavo Goni ◽  
Guillermo Podesta ◽  
Otis Brown ◽  
James Brown

Orbit error is one of the largest sources of uncertainty in studies of ocean dynamics using satellite altimeters. The sensitivity of GEOSAT mesoscale ocean variability estimates to altimeter orbit precision in the SW Atlantic is analyzed using three GEOSAT data sets derived from different orbit estimation methods: (a) the original GDR data set, which has the lowest orbit precision, (b) the GEM-T2 set, constructed from a much more precise orbital model, and (c) the Sirkes-Wunsch data set, derived from additional spectral analysis of the GEM-T2 data set. Differences among the data sets are investigated for two tracks in dynamically dissimilar regimes of the Southwestern Atlantic Ocean, by comparing: (a) distinctive features of the average power density spectra of the sea height residuals and (b) space-time diagrams of sea height residuals. The variability estimates produced by the three data sets are extremely similar in both regimes after removal of the time-dependent component of the orbit error using a quadratic fit. Our results indicate that altimeter orbit precision with appropriate processing plays only a minor role in studies of mesoscale ocean variability.


2019 ◽  
Vol 2019 (1) ◽  
pp. 360-368
Author(s):  
Mekides Assefa Abebe ◽  
Jon Yngve Hardeberg

Different whiteboard image degradations highly reduce the legibility of pen-stroke content as well as the overall quality of the images. Consequently, different researchers addressed the problem through different image enhancement techniques. Most of the state-of-the-art approaches applied common image processing techniques such as background foreground segmentation, text extraction, contrast and color enhancements and white balancing. However, such types of conventional enhancement methods are incapable of recovering severely degraded pen-stroke contents and produce artifacts in the presence of complex pen-stroke illustrations. In order to surmount such problems, the authors have proposed a deep learning based solution. They have contributed a new whiteboard image data set and adopted two deep convolutional neural network architectures for whiteboard image quality enhancement applications. Their different evaluations of the trained models demonstrated their superior performances over the conventional methods.


Sign in / Sign up

Export Citation Format

Share Document