scholarly journals PatchIndex: exploiting approximate constraints in distributed databases

Author(s):  
Steffen Kläbe ◽  
Kai-Uwe Sattler ◽  
Stephan Baumann

AbstractCloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.

1988 ◽  
Vol 11 (3) ◽  
pp. 241-265
Author(s):  
W. Marek ◽  
C. Rauszer

In this paper, we address the problem of query optimization in distributed databases. We show that horizontal partitions of databases, generated by products of equivalence relations, induce optimization techniques for the basic database operations (i.e., the selection, projection, and join operators). In the case of selection, our method allows for restriction of the number of blocks to be searched in the selection process and subsequent simplification of the selection formula at each block. For the natural join operation, we propose an algorithm that reduces the computation of fragments. Proofs of the correctness of our algorithms are also included.


Author(s):  
Hussein Issa ◽  
Amer Qasim ◽  
Ghaleb El Refae ◽  
Alexander J. Sannella

2004 ◽  
Vol 165 (1-2) ◽  
pp. 103-127 ◽  
Author(s):  
David Taniar ◽  
J. Wenny Rahayu

Author(s):  
Dennis T. Kennedy ◽  
Dennis M. Crossen ◽  
Kathryn A. Szabat

Big Data Analytics has changed the way organizations make decisions, manage business processes, and create new products and services. Business analytics is the use of data, information technology, statistical analysis, and quantitative methods and models to support organizational decision making and problem solving. The main categories of business analytics are descriptive analytics, predictive analytics, and prescriptive analytics. Big Data is data that exceeds the processing capacity of conventional database systems and is typically defined by three dimensions known as the Three V's: Volume, Variety, and Velocity. Big Data brings big challenges. Big Data not only has influenced the analytics that are utilized but also has affected technologies and the people who use them. At the same time Big Data brings challenges, it presents opportunities. Those who embrace Big Data and effective Big Data Analytics as a business imperative can gain competitive advantage.


Author(s):  
Carlos Goncalves ◽  
Luis Assuncao ◽  
Jose C. Cunha

Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and our experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.


2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Diep T Vu ◽  
Duc H Bui ◽  
Giang T Le ◽  
Hai K Nguyen ◽  
Duong C Thanh ◽  
...  

Objective: To use Epi Info Cloud Data Analytics (ECDA) to improve the management, quality and utilization of the Vietnam National HIV Surveillance data.Introduction: HIV surveillance in Vietnam is comprised of different surveillance systems including the HIV sentinel surveillance (HSS). The HSS is an annual, multi-site survey to monitor HIV sero-prevalence and risk behaviors among key populations. In 2015, the Vietnam Administration on HIV/AIDS Control (VAAC) installed the Epi Info Cloud Data Analytics (ECDA), a free web-based analytical and visualization program developed by the Centers for Disease Control and Prevention (CDC)(1) to serve as an information management system for HIV surveillance. Until 2016, provincial surveys, recorded on paper, were computerized and submitted to VAAC, which was responsible for merging individual provincial datasets to form a national HSS dataset. Feedback on HSS issues were provided to provinces 3 to 6 months after survey conclusion. With the use of tablets for field data collection in 2017, provincial survey data were recorded electronically and transferred to VAAC at the end of each survey day, thus enabling instant updating of the national 2017 HSS dataset on daily basis. Upon availability of the national HSS dataset on VAAC’s server, ECDA enhanced wider access and prompt analysis for staff at all levels (figure 1). This abstract describes the use of ECDA, together with tablet-based data collection to improve management, quality and use of surveillance data.Methods: After the installation of the ECDA on VAAC’s server in 2015, investments were made at all levels of the surveillance systems to build the capacity to operate and maintain the ECDA. These included trainings on programming, administration, and utilization of ECDA at the central level; creating a centralized database through abstracting and linking different surveillance datasets; developing analysis templates to assist provincial-specific reports; and trainings on access and use of the ECDA to provincial staff. One hundred and eighty five ECDA analyst accounts, authorized for submission, viewing and analysis of data, were created for surveillance staff in 63 provinces and 7 agencies. Six administrator accounts, created for users at central and regional level, were authorized for editing data and management of user accounts. In 2017, more ECDA activities were conducted to: (i) develop analysis dashboards to track progress and data quality of HSS provincial surveys; (ii) facilitate frequent data reviews at central and regional levels; (iii) provide feedback to provinces on survey issues including sample selection.Results: Since 2015, separate national datasets including the HSS, HIV case reports, HIV routine program reports were systematically cleaned and merged to form a centralized national database, which was then centrally stored and regularly backed up. Access to the national database was granted to surveillance staff in all 63 provinces through 185 designated ECDA accounts. During the 2017 HSS surveys, 70 ECDA users in 20 HSS provinces were active to manage and use the HSS data. Twelve weekly reviews of HSS provincial data were conducted at national level throughout the 2017 HSS survey. Ninety percent of provinces received feedback on their survey data as early as the first week of field data collection. The national 2017 HSS dataset and its analysis were available immediately after the completion of the last provincial survey, which was about 3 to 6 months quicker than reports of previous years. More importantly, the fresh results of the 2017 HSS survey were available and used for the 2018 Vietnam HIV national planning circle (table 1).Conclusions: ECDA is a quick, relevant, free program to improve the management and analysis of HIV surveillance data. Using ECDA, it is easy to generate and modify analysis dashboards that enhances utilization of surveillance data. Successful administration and use of the ECDA during the 2017 HSS survey is positive evidence for Ministry of Health to consider institutionalization of the program in Vietnam surveillance systems.


2019 ◽  
Vol 36 (06) ◽  
pp. 1940012
Author(s):  
Joost Berkhout ◽  
Bernd Heidergott ◽  
Henry Lam ◽  
Yijie Peng

In the past decades we have witnessed a paradigm-shift from scarcity of data to abundance of data. Big data and data analytics have fundamentally reshaped many areas including operations research. In this paper, we discuss how to integrate data with the model-based analysis in a controlled way. Specifically, we consider techniques to quantify input uncertainty and the decision making under input uncertainty. Numerical experiments demonstrate that different ways in decision making may lead to significantly different outcomes in a maintenance problem.


Sign in / Sign up

Export Citation Format

Share Document