Discovery and Application of Data Dependencies

Mapping Intimacies ◽

10.5753/ctd.2021.15749 ◽

2021 ◽

Author(s):

Eduardo Henrique Monteiro Pena ◽

Eduardo Cunha De Almeida

Keyword(s):

Experimental Evaluation ◽

State Of The Art ◽

Expressive Power ◽

Data Consistency ◽

Functional Dependencies ◽

Business Rules ◽

Data Dependencies ◽

Query Response Time ◽

Discovery Algorithms ◽

Selection Of

This work makes contributions that reach central problems in connection with data dependencies. The first problem regards the discovery of dependencies of high expressive power. We introduce an efficient algorithm for the discovery of denial constraints: a type of dependency that has enough expressive power to generalize other important types of dependencies and to express complex business rules. The second problem concerns the application of dependencies for improving data consistency. We present a modification for traditional dependency discovery approaches that enables the dependency discovery algorithms to return reliable results even if they run on data containing some inconsistent records. Also, we present a system for detecting violations of dependencies efficiently. Our extensive experimental evaluation shows that our system is up to three orders-of-magnitude faster than state-of-the-art solutions, especially for larger datasets and massive numbers of dependency violations. The last contribution in this work regards the application of dependencies in query optimization. We present a system for the automatic discovery and selection of functional dependencies. Our experimental evaluation shows that our system selects relevant functional dependencies that help reducing the overall query response time for various types of query workloads.

Download Full-text

Approximation Measures for Conditional Functional Dependencies Using Stripped Conditional Partitions

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v7i3.pp1385-1397 ◽

2017 ◽

Vol 7 (3) ◽

pp. 1385

Author(s):

Anh Duy Tran ◽

Somjit Arch-int ◽

Ngamnij Arch-int

Keyword(s):

Real Data ◽

Equivalence Classes ◽

Data Sets ◽

Functional Dependencies ◽

Quality Of Data ◽

Data Dependencies ◽

Incomplete Knowledge ◽

Discovery Algorithms ◽

Knowledge Granularity

Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs.

Download Full-text

Experimental Evaluation and Selection of Data Consistency Mechanisms for Hard Real-Time Applications on Multicore Platforms

IEEE Transactions on Industrial Informatics ◽

10.1109/tii.2013.2290585 ◽

2014 ◽

Vol 10 (2) ◽

pp. 903-918 ◽

Cited By ~ 12

Author(s):

Gang Han ◽

Haibo Zeng ◽

Marco Di Natale ◽

Xue Liu ◽

Wenhua Dou

Keyword(s):

Real Time ◽

Experimental Evaluation ◽

Data Consistency ◽

Multicore Platforms ◽

Real Time Applications ◽

Hard Real Time ◽

Selection Of

Download Full-text

Modified algorithm for fast bandwidth selection for kernel estimates of multidimensional probability densities

Izmeritel`naya Tekhnika ◽

10.32446/0368-1025it.2020-11-9-13 ◽

2020 ◽

pp. 9-13

Author(s):

A. V. Lapko ◽

V. A. Lapko

Keyword(s):

Probability Density ◽

Optimal Parameter ◽

Random Variables ◽

Kernel Functions ◽

Independent Random Variables ◽

Functional Dependencies ◽

Approximation Properties ◽

Probability Density Estimation ◽

Multidimensional Probability ◽

Selection Of

An original technique has been justified for the fast bandwidths selection of kernel functions in a nonparametric estimate of the multidimensional probability density of the Rosenblatt–Parzen type. The proposed method makes it possible to significantly increase the computational efficiency of the optimization procedure for kernel probability density estimates in the conditions of large-volume statistical data in comparison with traditional approaches. The basis of the proposed approach is the analysis of the optimal parameter formula for the bandwidths of a multidimensional kernel probability density estimate. Dependencies between the nonlinear functional on the probability density and its derivatives up to the second order inclusive of the antikurtosis coefficients of random variables are found. The bandwidths for each random variable are represented as the product of an undefined parameter and their mean square deviation. The influence of the error in restoring the established functional dependencies on the approximation properties of the kernel probability density estimation is determined. The obtained results are implemented as a method of synthesis and analysis of a fast bandwidths selection of the kernel estimation of the two-dimensional probability density of independent random variables. This method uses data on the quantitative characteristics of a family of lognormal distribution laws.

Download Full-text

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3460122 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-22

Author(s):

Jerzy Proficz

Keyword(s):

Experimental Evaluation ◽

Data Exchange ◽

State Of The Art ◽

Monitoring And Evaluation ◽

The Other ◽

Early Data ◽

Cluster Architecture ◽

Novel Algorithms

Two novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI implementations and exploits an auxiliary background thread for early data exchange from faster processes to accelerate the performed all-gather operation. The other algorithm, Background Sorted Linear synchronized tree with Broadcast (BSLB), is built upon the already existing PAP-aware gather algorithm, that is, Background Sorted Linear Synchronized tree (BSLS), followed by a regular broadcast distributing gathered data to all participating processes. The background of the imbalanced PAP subject is described, along with the PAP monitoring and evaluation topics. An experimental evaluation of the algorithms based on a proposed mini-benchmark is presented. The mini-benchmark was performed over 2,000 times in a typical HPC cluster architecture with homogeneous compute nodes. The obtained results are analyzed according to different PATs, data sizes, and process numbers, showing that the proposed optimization works well for various configurations, is scalable, and can significantly reduce the all-gather elapsed times, in our case, up to factor 1.9 or 47% in comparison with the best state-of-the-art solution.

Download Full-text

Interpreting testing and assessment: A state-of-the-art review

Language Testing ◽

10.1177/02655322211036100 ◽

2021 ◽

pp. 026553222110361

Author(s):

Chao Han

Keyword(s):

State Of The Art ◽

Professional Certification ◽

Automatic Assessment ◽

Prospective Students ◽

Assessment Practice ◽

High Stakes ◽

Future Directions ◽

Assessment Tasks ◽

Interpreter Education ◽

Selection Of

Over the past decade, testing and assessing spoken-language interpreting has garnered an increasing amount of attention from stakeholders in interpreter education, professional certification, and interpreting research. This is because in these fields assessment results provide a critical evidential basis for high-stakes decisions, such as the selection of prospective students, the certification of interpreters, and the confirmation/refutation of research hypotheses. However, few reviews exist providing a comprehensive mapping of relevant practice and research. The present article therefore aims to offer a state-of-the-art review, summarizing the existing literature and discovering potential lacunae. In particular, the article first provides an overview of interpreting ability/competence and relevant research, followed by main testing and assessment practice (e.g., assessment tasks, assessment criteria, scoring methods, specificities of scoring operationalization), with a focus on operational diversity and psychometric properties. Second, the review describes a limited yet steadily growing body of empirical research that examines rater-mediated interpreting assessment, and casts light on automatic assessment as an emerging research topic. Third, the review discusses epistemological, psychometric, and practical challenges facing interpreting testers. Finally, it identifies future directions that could address the challenges arising from fast-changing pedagogical, educational, and professional landscapes.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

A look-ahead Monte Carlo simulation method for improving parental selection in trait introgression

Scientific Reports ◽

10.1038/s41598-021-83634-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Saba Moeinizade ◽

Ye Han ◽

Hieu Pham ◽

Guiping Hu ◽

Lizhi Wang

Keyword(s):

Monte Carlo ◽

State Of The Art ◽

Simulation Method ◽

Multiple Trait ◽

Look Ahead ◽

Probability Of Success ◽

Genetic Distribution ◽

Selection Of ◽

Trait Introgression

AbstractMultiple trait introgression is the process by which multiple desirable traits are converted from a donor to a recipient cultivar through backcrossing and selfing. The goal of this procedure is to recover all the attributes of the recipient cultivar, with the addition of the specified desirable traits. A crucial step in this process is the selection of parents to form new crosses. In this study, we propose a new selection approach that estimates the genetic distribution of the progeny of backcrosses after multiple generations using information of recombination events. Our objective is to select the most promising individuals for further backcrossing or selfing. To demonstrate the effectiveness of the proposed method, a case study has been conducted using maize data where our method is compared with state-of-the-art approaches. Simulation results suggest that the proposed method, look-ahead Monte Carlo, achieves higher probability of success than existing approaches. Our proposed selection method can assist breeders to efficiently design trait introgression projects.

Download Full-text

State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues

Diagnostic and Prognostic Research ◽

10.1186/s41512-020-00074-3 ◽

2020 ◽

Vol 4 (1) ◽

Cited By ~ 6

Author(s):

Willi Sauerbrei ◽

◽

Aris Perperoglou ◽

Matthias Schmid ◽

Michal Abrahamowicz ◽

...

Keyword(s):

State Of The Art ◽

Multivariable Analysis ◽

Selection Of Variables ◽

Functional Forms ◽

Selection Of

Download Full-text

A Mating Selection Based on Modified Strengthened Dominance Relation for NSGA-III

Mathematics ◽

10.3390/math9222837 ◽

2021 ◽

Vol 9 (22) ◽

pp. 2837

Author(s):

Saykat Dutta ◽

Sri Srinivasa Raju M ◽

Rammohan Mallipeddi ◽

Kedar Nath Das ◽

Dong-Gyu Lee

Keyword(s):

Pareto Front ◽

State Of The Art ◽

Weight Vector ◽

Dominance Relation ◽

Additional Parameter ◽

Pareto Dominance ◽

Environmental Selection ◽

Dominance Relationships ◽

Test Suites ◽

Selection Of

In multi/many-objective evolutionary algorithms (MOEAs), to alleviate the degraded convergence pressure of Pareto dominance with the increase in the number of objectives, numerous modified dominance relationships were proposed. Recently, the strengthened dominance relation (SDR) has been proposed, where the dominance area of a solution is determined by convergence degree and niche size (θ¯). Later, in controlled SDR (CSDR), θ¯ and an additional parameter (k) associated with the convergence degree are dynamically adjusted depending on the iteration count. Depending on the problem characteristics and the distribution of the current population, different situations require different values of k, rendering the linear reduction of k based on the generation count ineffective. This is because a particular value of k is expected to bias the dominance relationship towards a particular region on the Pareto front (PF). In addition, due to the same reason, using SDR or CSDR in the environmental selection cannot preserve the diversity of solutions required to cover the entire PF. Therefore, we propose an MOEA, referred to as NSGA-III*, where (1) a modified SDR (MSDR)-based mating selection with an adaptive ensemble of parameter k would prioritize parents from specific sections of the PF depending on k, and (2) the traditional weight vector and non-dominated sorting-based environmental selection of NSGA-III would protect the solutions corresponding to the entire PF. The performance of NSGA-III* is favourably compared with state-of-the-art MOEAs on DTLZ and WFG test suites with up to 10 objectives.

Download Full-text

X-Mark: a benchmark for node-attributed community discovery algorithms

Social Network Analysis and Mining ◽

10.1007/s13278-021-00823-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Salvatore Citraro ◽

Giulio Rossetti

Keyword(s):

State Of The Art ◽

Graph Clustering ◽

Clustering Methods ◽

Community Discovery ◽

Attributed Graphs ◽

Node Labels ◽

Discovery Algorithms ◽

Shed Light

AbstractGrouping well-connected nodes that also result in label-homogeneous clusters is a task often known as attribute-aware community discovery. While approaching node-enriched graph clustering methods, rigorous tools need to be developed for evaluating the quality of the resulting partitions. In this work, we present X-Mark, a model that generates synthetic node-attributed graphs with planted communities. Its novelty consists in forming communities and node labels contextually while handling categorical or continuous attributive information. Moreover, we propose a comparison between attribute-aware algorithms, testing them against our benchmark. Accordingly to different classification schema from recent state-of-the-art surveys, our results suggest that X-Mark can shed light on the differences between several families of algorithms.

Download Full-text