COMPARE

2021 ◽  
Vol 14 (11) ◽  
pp. 2419-2431
Author(s):  
Tarique Siddiqui ◽  
Surajit Chaudhuri ◽  
Vivek Narasayya

Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today's database systems, user-defined functions (UDFs), as well as middleware solutions that compare subsets outside the databases.

2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Bong-Jun Yi ◽  
Do-Gil Lee ◽  
Hae-Chang Rim

Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.


2018 ◽  
Vol 8 ◽  
pp. 263-269
Author(s):  
Grzegorz Dziewit ◽  
Jakub Korczyński ◽  
Maria Skublewska-Paszkowska

Comparison of efficiency is not a trivial phenomenon because of disparities between different database systems. This paper presents a methodology of comparing relational database systems in respect of mean time of execution individual DML queries containing subqueries and conjunction of tables. The presented methodology can be additionally accommodated to studies of efficiency in a range of database system itself (study of queries executed directly in database engine). The described methodology allows to receive statement telling which database system is better in comparison to another in dependency of functionalities fulfilled by external application. In the article the analysis of mean time of execution individual DML queries was performed.Two research hypotheses have been put forward: "Microsoft SQL Server database system needs less time to execute INSERT and UPDATE queries than Oracle database" and "Oracle database system needs less time to execute DML queries with binary data than SQL Server"


Author(s):  
Karthikeyan Ramasamy ◽  
Prasad M. Deshpande

About three decades ago, when Codd (1970) invented the relational database model, it took the database world by storm. The enterprises that adapted it early won a large competitive edge. The past two decades have witnessed tremendous growth of relational database systems, and today the relational model is by far the dominant data model and is the foundation for leading DBMS products, including IBM DB2, Informix, Oracle, Sybase, and Microsoft SQL server. Relational databases have become a multibillion-dollar industry.


2021 ◽  
Vol 68 (4) ◽  
pp. 1-25
Author(s):  
Thodoris Lykouris ◽  
Sergei Vassilvitskii

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.


Author(s):  
Steffen Kläbe ◽  
Kai-Uwe Sattler ◽  
Stephan Baumann

AbstractCloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.


Author(s):  
Jun Sun ◽  
Lingchen Kong ◽  
Mei Li

With the development of modern science and technology, it is easy to obtain a large number of high-dimensional datasets, which are related but different. Classical unimodel analysis is less likely to capture potential links between the different datasets. Recently, a collaborative regression model based on least square (LS) method for this problem has been proposed. In this paper, we propose a robust collaborative regression based on the least absolute deviation (LAD). We give the statistical interpretation of the LS-collaborative regression and LAD-collaborative regression. Then we design an efficient symmetric Gauss–Seidel-based alternating direction method of multipliers algorithm to solve the two models, which has the global convergence and the Q-linear rate of convergence. Finally we report numerical experiments to illustrate the efficiency of the proposed methods.


2019 ◽  
Vol 63 (8-9-10) ◽  
pp. 343-357
Author(s):  
Adam Kuspa ◽  
Gad Shaulsky

William Farnsworth Loomis studied the social amoeba Dictyostelium discoideum for more than fifty years as a professor of biology at the University of California, San Diego, USA. This biographical reflection describes Dr. Loomis’ major scientific contributions to the field within a career arc that spanned the early days of molecular biology up to the present day where the acquisition of high-dimensional datasets drive research. Dr. Loomis explored the genetic control of social amoeba development, delineated mechanisms of cell differentiation, and significantly advanced genetic and genomic technology for the field. The details of Dr. Loomis’ multifaceted career are drawn from his published work, from an autobiographical essay that he wrote near the end of his career and from extensive conversations between him and the two authors, many of which took place on the deck of his beachfront home in Del Mar, California.


2019 ◽  
Vol 19 (2) ◽  
pp. 117-132
Author(s):  
Fernando Almeida ◽  
Pedro Silva ◽  
Fernando Araújo

Abstract Databases provide an efficient way to store, retrieve and analyze data. Oracle relational database is one of the most popular database management systems that is widely used in a different variety of industries and businesses. Therefore, it is important to guarantee that the database access and data manipulation is optimized for reducing database system response time. This paper intends to analyze the performance and the main optimization techniques (Forall, Returning, and Bulk Collect) that can be adopted for Oracle Relational Databases. The results have shown that the adoption of Forall and Bulk Collect approaches bring significant benefits in terms of execution time. Furthermore, the growth rate of the average execution time is lower for Bulk Collect than Forall. However, adoption of Returning approach doesn’t bring significant statistical benefits.


Sign in / Sign up

Export Citation Format

Share Document