PatchIndex: exploiting approximate constraints in distributed databases

Distributed and Parallel Databases ◽

10.1007/s10619-021-07326-1 ◽

2021 ◽

Author(s):

Steffen Kläbe ◽

Kai-Uwe Sattler ◽

Stephan Baumann

Keyword(s):

Data Analytics ◽

Distributed Databases ◽

Database Systems ◽

Optimization Techniques ◽

Cloud Data ◽

Database Administrator ◽

Small Set ◽

Parallel Index ◽

Access Data ◽

Integrate Data

AbstractCloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.

Download Full-text

Query Optimization in the Databases Distributed by Means of Product Equivalence Relations

Fundamenta Informaticae ◽

10.3233/fi-1988-11303 ◽

1988 ◽

Vol 11 (3) ◽

pp. 241-265

Author(s):

W. Marek ◽

C. Rauszer

Keyword(s):

Query Optimization ◽

Selection Process ◽

Distributed Databases ◽

Optimization Techniques ◽

Equivalence Relations ◽

Database Operations ◽

By Products

In this paper, we address the problem of query optimization in distributed databases. We show that horizontal partitions of databases, generated by products of equivalence relations, induce optimization techniques for the basic database operations (i.e., the selection, projection, and join operators). In the case of selection, our method allows for restriction of the number of blocks to be searched in the selection process and subsequent simplification of the selection formula at each block. For the natural join operation, we propose an algorithm that reduces the computation of fragments. Proofs of the correctness of our algorithms are also included.

Download Full-text

Improved Ranking for Search Over Encrypted Cloud Data Using Parallel Index

International Conference on Advanced Computing Networking and Informatics - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-13-2673-8_12 ◽

2018 ◽

pp. 97-107

Author(s):

Anu Khurana ◽

C. Rama Krishna ◽

Navdeep Kaur

Keyword(s):

Cloud Data ◽

Parallel Index

Download Full-text

A Model to Integrate Data Analytics in the Undergraduate Accounting Curriculum.

10.26226/morressier.5f0c7d3058e581e69b05cf79 ◽

2020 ◽

Author(s):

Hussein Issa ◽

Amer Qasim ◽

Ghaleb El Refae ◽

Alexander J. Sannella

Keyword(s):

Data Analytics ◽

Accounting Curriculum ◽

Integrate Data

Download Full-text

Global parallel index for multi-processors database systems

Information Sciences ◽

10.1016/j.ins.2003.09.019 ◽

2004 ◽

Vol 165 (1-2) ◽

pp. 103-127 ◽

Cited By ~ 33

Author(s):

David Taniar ◽

J. Wenny Rahayu

Keyword(s):

Database Systems ◽

Parallel Index

Download Full-text

Business Analytics and Big Data

Advances in Business Information Systems and Analytics - Handbook of Research on Organizational Transformations through Big Data Analytics ◽

10.4018/978-1-4666-7272-7.ch001 ◽

2015 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Dennis T. Kennedy ◽

Dennis M. Crossen ◽

Kathryn A. Szabat

Keyword(s):

Big Data ◽

Data Analytics ◽

Quantitative Methods ◽

Business Processes ◽

Predictive Analytics ◽

Big Data Analytics ◽

Database Systems ◽

Three Dimensions ◽

Business Analytics ◽

The People

Big Data Analytics has changed the way organizations make decisions, manage business processes, and create new products and services. Business analytics is the use of data, information technology, statistical analysis, and quantitative methods and models to support organizational decision making and problem solving. The main categories of business analytics are descriptive analytics, predictive analytics, and prescriptive analytics. Big Data is data that exceeds the processing capacity of conventional database systems and is typically defined by three dimensions known as the Three V's: Volume, Variety, and Velocity. Big Data brings big challenges. Big Data not only has influenced the analytics that are utilized but also has affected technologies and the people who use them. At the same time Big Data brings challenges, it presents opportunities. Those who embrace Big Data and effective Big Data Analytics as a business imperative can gain competitive advantage.

Download Full-text

Flexible MapReduce Workflows for Cloud Data Analytics

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2013100104 ◽

2013 ◽

Vol 5 (4) ◽

pp. 48-64 ◽

Cited By ~ 1

Author(s):

Carlos Goncalves ◽

Luis Assuncao ◽

Jose C. Cunha

Keyword(s):

Text Mining ◽

Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Tuple Space ◽

Cloud Data ◽

Intermediate Data ◽

Speed Up ◽

Mapreduce Model

Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and our experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.

Download Full-text

Epi Info Cloud Data Analytics to improve quality of HIV Surveillance in Vietnam.

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v10i1.8361 ◽

2018 ◽

Vol 10 (1) ◽

Author(s):

Diep T Vu ◽

Duc H Bui ◽

Giang T Le ◽

Hai K Nguyen ◽

Duong C Thanh ◽

...

Keyword(s):

Data Collection ◽

Data Analytics ◽

Field Data ◽

National Database ◽

Surveillance Systems ◽

Cloud Data ◽

Hiv Surveillance ◽

Management Quality ◽

Field Data Collection

Objective: To use Epi Info Cloud Data Analytics (ECDA) to improve the management, quality and utilization of the Vietnam National HIV Surveillance data.Introduction: HIV surveillance in Vietnam is comprised of different surveillance systems including the HIV sentinel surveillance (HSS). The HSS is an annual, multi-site survey to monitor HIV sero-prevalence and risk behaviors among key populations. In 2015, the Vietnam Administration on HIV/AIDS Control (VAAC) installed the Epi Info Cloud Data Analytics (ECDA), a free web-based analytical and visualization program developed by the Centers for Disease Control and Prevention (CDC)(1) to serve as an information management system for HIV surveillance. Until 2016, provincial surveys, recorded on paper, were computerized and submitted to VAAC, which was responsible for merging individual provincial datasets to form a national HSS dataset. Feedback on HSS issues were provided to provinces 3 to 6 months after survey conclusion. With the use of tablets for field data collection in 2017, provincial survey data were recorded electronically and transferred to VAAC at the end of each survey day, thus enabling instant updating of the national 2017 HSS dataset on daily basis. Upon availability of the national HSS dataset on VAAC’s server, ECDA enhanced wider access and prompt analysis for staff at all levels (figure 1). This abstract describes the use of ECDA, together with tablet-based data collection to improve management, quality and use of surveillance data.Methods: After the installation of the ECDA on VAAC’s server in 2015, investments were made at all levels of the surveillance systems to build the capacity to operate and maintain the ECDA. These included trainings on programming, administration, and utilization of ECDA at the central level; creating a centralized database through abstracting and linking different surveillance datasets; developing analysis templates to assist provincial-specific reports; and trainings on access and use of the ECDA to provincial staff. One hundred and eighty five ECDA analyst accounts, authorized for submission, viewing and analysis of data, were created for surveillance staff in 63 provinces and 7 agencies. Six administrator accounts, created for users at central and regional level, were authorized for editing data and management of user accounts. In 2017, more ECDA activities were conducted to: (i) develop analysis dashboards to track progress and data quality of HSS provincial surveys; (ii) facilitate frequent data reviews at central and regional levels; (iii) provide feedback to provinces on survey issues including sample selection.Results: Since 2015, separate national datasets including the HSS, HIV case reports, HIV routine program reports were systematically cleaned and merged to form a centralized national database, which was then centrally stored and regularly backed up. Access to the national database was granted to surveillance staff in all 63 provinces through 185 designated ECDA accounts. During the 2017 HSS surveys, 70 ECDA users in 20 HSS provinces were active to manage and use the HSS data. Twelve weekly reviews of HSS provincial data were conducted at national level throughout the 2017 HSS survey. Ninety percent of provinces received feedback on their survey data as early as the first week of field data collection. The national 2017 HSS dataset and its analysis were available immediately after the completion of the last provincial survey, which was about 3 to 6 months quicker than reports of previous years. More importantly, the fresh results of the 2017 HSS survey were available and used for the 2018 Vietnam HIV national planning circle (table 1).Conclusions: ECDA is a quick, relevant, free program to improve the management and analysis of HIV surveillance data. Using ECDA, it is easy to generate and modify analysis dashboards that enhances utilization of surveillance data. Successful administration and use of the ECDA during the 2017 HSS survey is positive evidence for Ministry of Health to consider institutionalization of the program in Vietnam surveillance systems.

Download Full-text

Security on database systems and distributed databases

Proceedings. 25th Annual 1991 IEEE International Carnahan Conference on Security Technology ◽

10.1109/ccst.1991.202216 ◽

2002 ◽

Author(s):

M.-Y. Lai

Keyword(s):

Distributed Databases ◽

Database Systems

Download Full-text

From Data to Stochastic Modeling and Decision Making: What Can We Do Better?

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595919400128 ◽

2019 ◽

Vol 36 (06) ◽

pp. 1940012

Author(s):

Joost Berkhout ◽

Bernd Heidergott ◽

Henry Lam ◽

Yijie Peng

Keyword(s):

Decision Making ◽

Operations Research ◽

Paradigm Shift ◽

Data Analytics ◽

Numerical Experiments ◽

Input Uncertainty ◽

The Past ◽

Integrate Data ◽

Maintenance Problem ◽

Model Based Analysis

In the past decades we have witnessed a paradigm-shift from scarcity of data to abundance of data. Big data and data analytics have fundamentally reshaped many areas including operations research. In this paper, we discuss how to integrate data with the model-based analysis in a controlled way. Specifically, we consider techniques to quantify input uncertainty and the decision making under input uncertainty. Numerical experiments demonstrate that different ways in decision making may lead to significantly different outcomes in a maintenance problem.

Download Full-text

A performance study on large-scale data analytics using disk-based and in-memory database systems

2017 IEEE International Conference on Big Data and Smart Computing (BigComp) ◽

10.1109/bigcomp.2017.7881706 ◽

2017 ◽

Cited By ~ 1

Author(s):

Pingfu Chao ◽

Dan He ◽

Shazia Sadiq ◽

Kai Zheng ◽

Xiaofang Zhou

Keyword(s):

Data Analytics ◽

Large Scale ◽

Database Systems ◽

Performance Study ◽

Large Scale Data ◽

Scale Data ◽

A Performance

Download Full-text