COMPARE

Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today's database systems, user-defined functions (UDFs), as well as middleware solutions that compare subsets outside the databases.

Download Full-text

The Effects of Feature Optimization on High-Dimensional Essay Data

Mathematical Problems in Engineering ◽

10.1155/2015/421642 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Bong-Jun Yi ◽

Do-Gil Lee ◽

Hae-Chang Rim

Keyword(s):

Poor Performance ◽

Feature Space ◽

Optimization Techniques ◽

Training Data ◽

High Dimensional ◽

Automated Essay Scoring ◽

Training Time ◽

Training And Performance Improvement ◽

Feature Optimization ◽

And Performance

Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.

Download Full-text

Performance analysis of relational databases Oracle and MS SQL based on desktop application

Journal of Computer Sciences Institute ◽

10.35784/jcsi.693 ◽

2018 ◽

Vol 8 ◽

pp. 263-269

Author(s):

Grzegorz Dziewit ◽

Jakub Korczyński ◽

Maria Skublewska-Paszkowska

Keyword(s):

Binary Data ◽

Relational Databases ◽

Database Systems ◽

Database System ◽

Sql Server ◽

External Application ◽

Oracle Database ◽

Sql Server Database ◽

Relational Database Systems ◽

Mean Time

Comparison of efficiency is not a trivial phenomenon because of disparities between different database systems. This paper presents a methodology of comparing relational database systems in respect of mean time of execution individual DML queries containing subqueries and conjunction of tables. The presented methodology can be additionally accommodated to studies of efficiency in a range of database system itself (study of queries executed directly in database engine). The described methodology allows to receive statement telling which database system is better in comparison to another in dependency of functionalities fulfilled by external application. In the article the analysis of mean time of execution individual DML queries was performed.Two research hypotheses have been put forward: "Microsoft SQL Server database system needs less time to execute INSERT and UPDATE queries than Oracle database" and "Oracle database system needs less time to execute DML queries with binary data than SQL Server"

Download Full-text

Set Valued Attributes

Encyclopedia of Database Technologies and Applications ◽

10.4018/978-1-59140-560-3.ch104 ◽

2005 ◽

pp. 632-637 ◽

Cited By ~ 3

Author(s):

Karthikeyan Ramasamy ◽

Prasad M. Deshpande

Keyword(s):

Relational Database ◽

Data Model ◽

Relational Databases ◽

Database Systems ◽

Sql Server ◽

Relational Model ◽

Database Model ◽

Competitive Edge ◽

The Past ◽

Relational Database Systems

About three decades ago, when Codd (1970) invented the relational database model, it took the database world by storm. The enterprises that adapted it early won a large competitive edge. The past two decades have witnessed tremendous growth of relational database systems, and today the relational model is by far the dominant data model and is the foundation for leading DBMS products, including IBM DB2, Informix, Oracle, Sybase, and Microsoft SQL server. Relational databases have become a multibillion-dollar industry.

Download Full-text

Competitive Caching with Machine Learned Advice

Journal of the ACM ◽

10.1145/3447579 ◽

2021 ◽

Vol 68 (4) ◽

pp. 1-25

Author(s):

Thodoris Lykouris ◽

Sergei Vassilvitskii

Keyword(s):

Online Algorithms ◽

Empirical Evaluation ◽

Optimal Solution ◽

Poor Performance ◽

Machine Learning Algorithms ◽

Average Error ◽

Generalization Error ◽

Worst Case ◽

Future Events ◽

Real World Datasets

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.

Download Full-text

A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs

Information Systems ◽

10.1016/j.is.2013.09.001 ◽

2014 ◽

Vol 40 ◽

pp. 1-10 ◽

Cited By ~ 8

Author(s):

Renato Vimieiro ◽

Pablo Moscato

Keyword(s):

New Method ◽

High Dimensional ◽

Emerging Patterns ◽

High Dimensional Datasets

Download Full-text

PatchIndex: exploiting approximate constraints in distributed databases

Distributed and Parallel Databases ◽

10.1007/s10619-021-07326-1 ◽

2021 ◽

Author(s):

Steffen Kläbe ◽

Kai-Uwe Sattler ◽

Stephan Baumann

Keyword(s):

Data Analytics ◽

Distributed Databases ◽

Database Systems ◽

Optimization Techniques ◽

Cloud Data ◽

Database Administrator ◽

Small Set ◽

Parallel Index ◽

Access Data ◽

Integrate Data

AbstractCloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.

Download Full-text

Unstructured borderline self-organizing map: Learning highly imbalanced, high-dimensional datasets for fault detection

Expert Systems with Applications ◽

10.1016/j.eswa.2021.116028 ◽

2022 ◽

Vol 188 ◽

pp. 116028

Author(s):

Jaeyeon Jang ◽

Chang Ouk Kim

Keyword(s):

Fault Detection ◽

High Dimensional ◽

Self Organizing Map ◽

Map Learning ◽

High Dimensional Datasets ◽

Self Organizing

Download Full-text

Fast Algorithms for LS and LAD-Collaborative Regression

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595922500014 ◽

2021 ◽

Author(s):

Jun Sun ◽

Lingchen Kong ◽

Mei Li

Keyword(s):

Numerical Experiments ◽

Modern Science ◽

Alternating Direction Method ◽

Least Square ◽

High Dimensional ◽

Statistical Interpretation ◽

Absolute Deviation ◽

Linear Rate ◽

Alternating Direction ◽

High Dimensional Datasets

With the development of modern science and technology, it is easy to obtain a large number of high-dimensional datasets, which are related but different. Classical unimodel analysis is less likely to capture potential links between the different datasets. Recently, a collaborative regression model based on least square (LS) method for this problem has been proposed. In this paper, we propose a robust collaborative regression based on the least absolute deviation (LAD). We give the statistical interpretation of the LS-collaborative regression and LAD-collaborative regression. Then we design an efficient symmetric Gauss–Seidel-based alternating direction method of multipliers algorithm to solve the two models, which has the global convergence and the Q-linear rate of convergence. Finally we report numerical experiments to illustrate the efficiency of the proposed methods.

Download Full-text

(Auto)Biographical reflections on the contributions of William F. Loomis (1940-2016) to Dictyostelium biology

The International Journal of Developmental Biology ◽

10.1387/ijdb.190224ak ◽

2019 ◽

Vol 63 (8-9-10) ◽

pp. 343-357

Author(s):

Adam Kuspa ◽

Gad Shaulsky

Keyword(s):

Cell Differentiation ◽

Molecular Biology ◽

Genetic Control ◽

Dictyostelium Discoideum ◽

University Of California ◽

High Dimensional ◽

Social Amoeba ◽

The Social ◽

The University ◽

High Dimensional Datasets

William Farnsworth Loomis studied the social amoeba Dictyostelium discoideum for more than fifty years as a professor of biology at the University of California, San Diego, USA. This biographical reflection describes Dr. Loomis’ major scientific contributions to the field within a career arc that spanned the early days of molecular biology up to the present day where the acquisition of high-dimensional datasets drive research. Dr. Loomis explored the genetic control of social amoeba development, delineated mechanisms of cell differentiation, and significantly advanced genetic and genomic technology for the field. The details of Dr. Loomis’ multifaceted career are drawn from his published work, from an autobiographical essay that he wrote near the end of his career and from extensive conversations between him and the two authors, many of which took place on the deck of his beachfront home in Del Mar, California.

Download Full-text

Performance Analysis and Optimization Techniques for Oracle Relational Databases

Cybernetics and Information Technologies ◽

10.2478/cait-2019-0019 ◽

2019 ◽

Vol 19 (2) ◽

pp. 117-132

Author(s):

Fernando Almeida ◽

Pedro Silva ◽

Fernando Araújo

Keyword(s):

Execution Time ◽

Relational Databases ◽

Database System ◽

Optimization Techniques ◽

Management Systems ◽

System Response ◽

Database Access ◽

Data Manipulation ◽

System Response Time ◽

Analyze Data

Abstract Databases provide an efficient way to store, retrieve and analyze data. Oracle relational database is one of the most popular database management systems that is widely used in a different variety of industries and businesses. Therefore, it is important to guarantee that the database access and data manipulation is optimized for reducing database system response time. This paper intends to analyze the performance and the main optimization techniques (Forall, Returning, and Bulk Collect) that can be adopted for Oracle Relational Databases. The results have shown that the adoption of Forall and Bulk Collect approaches bring significant benefits in terms of execution time. Furthermore, the growth rate of the average execution time is lower for Bulk Collect than Forall. However, adoption of Returning approach doesn’t bring significant statistical benefits.

Download Full-text