scholarly journals Discovery of Knowledge by using Data Warehousing as well as ETL Processing

Testing is very essential in Data warehouse systems for decision making because the accuracy, validation and correctness of data depends on it. By looking to the characteristics and complexity of iData iwarehouse, iin ithis ipaper, iwe ihave itried ito ishow the scope of automated testing in assuring ibest data iwarehouse isolutions. Firstly, we developed a data set generator for creating synthetic but near to real data; then in isynthesized idata, with ithe help of hand icoded Extraction, Transformation and Loading (ETL) routine, anomalies are classified. For the quality assurance of data for a Data warehouse and to give the idea of how important the iExtraction, iTransformation iand iLoading iis, some very important test cases were identified. After that, to ensure the quality of data, the procedures of automated testing iwere iembedded iin ihand icoded iETL iroutine. Statistical analysis was done and it revealed a big enhancement in the quality of data with the procedures of automated testing. It enhances the fact that automated testing gives promising results in the data warehouse quality. For effective and easy maintenance of distributed data,a novel architecture was proposed. Although the desired result of this research is achieved successfully and the objectives are promising, but still there's a need to validate the results with the real life environment, as this research was done in simulated environment, which may not always give the desired results in real life environment. Hence, the overall potential of the proposed architecture can be seen until it is deployed to manage the real data which is distributed globally.

Author(s):  
Maurizio Pighin ◽  
Lucio Ieronutti

Data Warehouses are increasingly used by commercial organizations to extract, from a huge amount of transactional data, concise information useful for supporting decision processes. However, the task of designing a data warehouse and evaluating its effectiveness is not trivial, especially in the case of large databases and in presence of redundant information. The meaning and the quality of selected attributes heavily influence the data warehouse’s effectiveness and the quality of derived decisions. Our research is focused on interactive methodologies and techniques targeted at supporting the data warehouse design and evaluation by taking into account the quality of initial data. In this chapter we propose an approach for supporting the data warehouses development and refinement, providing practical examples and demonstrating the effectiveness of our solution. Our approach is mainly based on two phases: the first one is targeted at interactively guiding the attributes selection by providing quantitative information measuring different statistical and syntactical aspects of data, while the second phase, based on a set of 3D visualizations, gives the opportunity of run-time refining taken design choices according to data examination and analysis. For experimenting proposed solutions on real data, we have developed a tool, called ELDA (EvaLuation DAta warehouse quality), that has been used for supporting the data warehouse design and evaluation.


Author(s):  
Georgia Garani ◽  
Nunziato Cassavia ◽  
Ilias K. Savvas

Data warehouse (DW) systems provide the best solution for intelligent data analysis and decision-making. Changes applied to data gradually in real life have to be projected to the DW. Slowly changing dimension (SCD) refers to the potential volatility of DW dimension members. The treatment of SCDs has a significant impact over the quality of data analysis. A new SCD type, Type N, is proposed in this research paper, which encapsulates volatile data into historical clusters. Type N preserves complete history of changes, additional tables, columns, and rows are not required, extra join operations are omitted, and surrogate keys are avoided. Type N is implemented and compared to other SCD types. Good candidates for practicing SCDs are spatiotemporal objects (i.e., objects whose shape or geometry evolves slowly over time). The case study used and implemented in this paper concerns shape-shifting constructions (i.e., buildings that respond to changing weather conditions or the way people use them). The results demonstrate the correctness and effectiveness of the proposed SCD Type N.


2020 ◽  
Vol 8 (1) ◽  
pp. T43-T53
Author(s):  
Isadora A. S. de Macedo ◽  
Jose Jadsom S. de Figueiredo ◽  
Matias C. de Sousa

Reservoir characterization requires accurate elastic logs. It is necessary to guarantee that the logging tool is stable during the drilling process to avoid compromising the measurements of the physical properties in the formation in the vicinity of the well. Irregularities along the borehole may happen, especially if the drilling device is passing through unconsolidated formations. This affects the signals recorded by the logging tool, and the measurements may be more impacted by the drilling mud than by the formation. The caliper log indicates the change in the diameter of the borehole with depth and can be used as an indicator of the quality of other logs whose data have been degraded by the enlargement or shrinkage of the borehole wall. Damaged well-log data, particularly density and velocity profiles, affect the quality and accuracy of the well-to-seismic tie. To investigate the effects of borehole enlargement on the well-to-seismic tie, an analysis of density log correction was performed. This approach uses Doll’s geometric factor to correct the density log for wellbore enlargement using the caliper readings. Because the wavelet is an important factor on the well tie, we tested our methodology with statistical and deterministic wavelet estimations. For both cases, the results using the real data set from the Viking Graben field — North Sea indicated up to a 7% improvement on the correlation between the real and synthetic seismic traces for well-to-seismic tie when the density correction was made.


2021 ◽  
Vol 15 (4) ◽  
pp. 1-20
Author(s):  
Georg Steinbuss ◽  
Klemens Böhm

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.


Geophysics ◽  
2013 ◽  
Vol 78 (2) ◽  
pp. G15-G24 ◽  
Author(s):  
Pejman Shamsipour ◽  
Denis Marcotte ◽  
Michel Chouteau ◽  
Martine Rivest ◽  
Abderrezak Bouchedda

The flexibility of geostatistical inversions in geophysics is limited by the use of stationary covariances, which, implicitly and mostly for mathematical convenience, assumes statistical homogeneity of the studied field. For fields showing sharp contrasts due, for example, to faults or folds, an approach based on the use of nonstationary covariances for cokriging inversion was developed. The approach was tested on two synthetic cases and one real data set. Inversion results based on the nonstationary covariance were compared to the results from the stationary covariance for two synthetic models. The nonstationary covariance better recovered the known synthetic models. With the real data set, the nonstationary assumption resulted in a better match with the known surface geology.


2012 ◽  
Vol 82 (9) ◽  
pp. 1615-1629 ◽  
Author(s):  
Bhupendra Singh ◽  
Puneet Kumar Gupta

2010 ◽  
Vol 16 (1) ◽  
pp. 43-53 ◽  
Author(s):  
Craig Yoshioka ◽  
Bridget Carragher ◽  
Clinton S. Potter

AbstractHere we evaluate a new grid substrate developed by ProtoChips Inc. (Raleigh, NC) for cryo-transmission electron microscopy. The new grids are fabricated from doped silicon carbide using processes adapted from the semiconductor industry. A major motivating purpose in the development of these grids was to increase the low-temperature conductivity of the substrate, a characteristic that is thought to affect the appearance of beam-induced movement (BIM) in transmission electron microscope (TEM) images of biological specimens. BIM degrades the quality of data and is especially severe when frozen biological specimens are tilted in the microscope. Our results show that this new substrate does indeed have a significant impact on reducing the appearance and severity of beam-induced movement in TEM images of tilted cryo-preserved samples. Furthermore, while we have not been able to ascertain the exact causes underlying the BIM phenomenon, we have evidence that the rigidity and flatness of these grids may play a major role in its reduction. This improvement in the reliability of imaging at tilt has a significant impact on using data collection methods such as random conical tilt or orthogonal tilt reconstruction with cryo-preserved samples. Reduction in BIM also has the potential for improving the resolution of three-dimensional cryo-reconstructions in general.


1994 ◽  
Vol 1 (2/3) ◽  
pp. 182-190 ◽  
Author(s):  
M. Eneva

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.


2020 ◽  
Vol 12 (1) ◽  
pp. 54-61
Author(s):  
Abdullah M. Almarashi ◽  
Khushnoor Khan

The current study focused on modeling times series using Bayesian Structural Time Series technique (BSTS) on a univariate data-set. Real-life secondary data from stock prices for flying cement covering a period of one year was used for analysis. Statistical results were based on simulation procedures using Kalman filter and Monte Carlo Markov Chain (MCMC). Though the current study involved stock prices data, the same approach can be applied to complex engineering process involving lead times. Results from the current study were compared with classical Autoregressive Integrated Moving Average (ARIMA) technique. For working out the Bayesian posterior sampling distributions BSTS package run with R software was used. Four BSTS models were used on a real data set to demonstrate the working of BSTS technique. The predictive accuracy for competing models was assessed using Forecasts plots and Mean Absolute Percent Error (MAPE). An easyto-follow approach was adopted so that both academicians and practitioners can easily replicate the mechanism. Findings from the study revealed that, for short-term forecasting, both ARIMA and BSTS are equally good but for long term forecasting, BSTS with local level is the most plausible option.


Geophysics ◽  
2014 ◽  
Vol 79 (1) ◽  
pp. M1-M10 ◽  
Author(s):  
Leonardo Azevedo ◽  
Ruben Nunes ◽  
Pedro Correia ◽  
Amílcar Soares ◽  
Luis Guerreiro ◽  
...  

Due to the nature of seismic inversion problems, there are multiple possible solutions that can equally fit the observed seismic data while diverging from the real subsurface model. Consequently, it is important to assess how inverse-impedance models are converging toward the real subsurface model. For this purpose, we evaluated a new methodology to combine the multidimensional scaling (MDS) technique with an iterative geostatistical elastic seismic inversion algorithm. The geostatistical inversion algorithm inverted partial angle stacks directly for acoustic and elastic impedance (AI and EI) models. It was based on a genetic algorithm in which the model perturbation at each iteration was performed recurring to stochastic sequential simulation. To assess the reliability and convergence of the inverted models at each step, the simulated models can be projected in a metric space computed by MDS. This projection allowed distinguishing similar from variable models and assessing the convergence of inverted models toward the real impedance ones. The geostatistical inversion results of a synthetic data set, in which the real AI and EI models are known, were plotted in this metric space along with the known impedance models. We applied the same principle to a real data set using a cross-validation technique. These examples revealed that the MDS is a valuable tool to evaluate the convergence of the inverse methodology and the impedance model variability among each iteration of the inversion process. Particularly for the geostatistical inversion algorithm we evaluated, it retrieves reliable impedance models while still producing a set of simulated models with considerable variability.


Sign in / Sign up

Export Citation Format

Share Document