scholarly journals Parallel Similarity Join with Data Partitioning for Prex Filtering

Author(s):  
Jaruloj Chongstitvatana ◽  
Methus Bhirakit

Similarity join is necessary for many applications, such as text search and data preparation. Measuring the similarity between two strings is expensive because inexact match is allowed and strings in databases are long. To reduce the cost of similarity join, the filtering-and-verify approach reduces the number of string pairs which require the computation of the similarity function. Prefix filtering is a filterand- verify method that filters out dissimilar strings by examining only their prefixes. The effectiveness of prefix filtering depends on the length of the examined prefix. An adaptive method is proposed to find a suitable prefix length for filtering. Based on this concept, we propose to divide a dataset into partitions, and a suitable prefix length is determined for each partition. It also allows similarity join to run in parallel for each data partition. From our experiment, the proposed method achieves higher performance because the number of candidates is reduced and the program can execute in parallel. Moreover, the performance of the proposed method depends on the number of data partitions. If the data partition is too small, the chosen prefix length for each partition may not be optimal.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Verônica A. Thode ◽  
Caetano T. Oliveira ◽  
Benoît Loeuille ◽  
Carolina M. Siniscalchi ◽  
José R. Pirani

AbstractWe assembled new plastomes of 19 species of Mikania and of Ageratina fastigiata, Litothamnus nitidus, and Stevia collina, all belonging to tribe Eupatorieae (Asteraceae). We analyzed the structure and content of the assembled plastomes and used the newly generated sequences to infer phylogenetic relationships and study the effects of different data partitions and inference methods on the topologies. Most phylogenetic studies with plastomes ignore that processes like recombination and biparental inheritance can occur in this organelle, using the whole genome as a single locus. Our study sought to compare this approach with multispecies coalescent methods that assume that different parts of the genome evolve at different rates. We found that the overall gene content, structure, and orientation are very conserved in all plastomes of the studied species. As observed in other Asteraceae, the 22 plastomes assembled here contain two nested inversions in the LSC region. The plastomes show similar length and the same gene content. The two most variable regions within Mikania are rpl32-ndhF and rpl16-rps3, while the three genes with the highest percentage of variable sites are ycf1, rpoA, and psbT. We generated six phylogenetic trees using concatenated maximum likelihood and multispecies coalescent methods and three data partitions: coding and non-coding sequences and both combined. All trees strongly support that the sampled Mikania species form a monophyletic group, which is further subdivided into three clades. The internal relationships within each clade are sensitive to the data partitioning and inference methods employed. The trees resulting from concatenated analysis are more similar among each other than to the correspondent tree generated with the same data partition but a different method. The multispecies coalescent analysis indicate a high level of incongruence between species and gene trees. The lack of resolution and congruence among trees can be explained by the sparse sampling (~ 0.45% of the currently accepted species) and by the low number of informative characters present in the sequences. Our study sheds light into the impact of data partitioning and methods over phylogenetic resolution and brings relevant information for the study of Mikania diversity and evolution, as well as for the Asteraceae family as a whole.


2019 ◽  
Vol 23 (1) ◽  
pp. 67-77 ◽  
Author(s):  
Yao Yevenyo Ziggah ◽  
Hu Youjian ◽  
Alfonso Rodrigo Tierra ◽  
Prosper Basommi Laari

The popularity of Artificial Neural Network (ANN) methodology has been growing in a wide variety of areas in geodesy and geospatial sciences. Its ability to perform coordinate transformation between different datums has been well documented in literature. In the application of the ANN methods for the coordinate transformation, only the train-test (hold-out cross-validation) approach has usually been used to evaluate their performance. Here, the data set is divided into two disjoint subsets thus, training (model building) and testing (model validation) respectively. However, one major drawback in the hold-out cross-validation procedure is inappropriate data partitioning. Improper split of the data could lead to a high variance and bias in the results generated. Besides, in a sparse dataset situation, the hold-out cross-validation is not suitable. For these reasons, the K-fold cross-validation approach has been recommended. Consequently, this study, for the first time, explored the potential of using K-fold cross-validation method in the performance assessment of radial basis function neural network and Bursa-Wolf model under data-insufficient situation in Ghana geodetic reference network. The statistical analysis of the results revealed that incorrect data partition could lead to a false reportage on the predictive performance of the transformation model. The findings revealed that the RBFNN and Bursa-Wolf model produced a transformation accuracy of 0.229 m and 0.469 m, respectively. It was also realised that a maximum horizontal error of 0.881 m and 2.131 m was given by the RBFNN and Bursa-Wolf. The obtained results per the cadastral surveying and plan production requirement set by the Ghana Survey and Mapping Division are applicable. This study will contribute to the usage of K-fold cross-validation approach in developing countries having the same sparse dataset situation like Ghana as well as in the geodetic sciences where ANN users seldom apply the statistical resampling technique.


1999 ◽  
Vol 09 (01) ◽  
pp. 135-146
Author(s):  
GAGAN AGRAWAL

An important component in compiling for distributed memory machines is data partitioning. While a number of automatic analysis techniques have been proposed for this phase, none of them is applicable for irregular problems. In this paper, we present compile-time analysis for determining data partitioning for such applications. We have developed a set of cost functions for determining communication and redistribution costs in irregular codes. We first determine the appropriate distributions for a single data parallel statement, and then use the cost functions with a greedy algorithm for computing distributions for the full program. Initial performance results on a 16 processor IBM SP-2 are also presented.


2013 ◽  
Vol 756-759 ◽  
pp. 1984-1988
Author(s):  
Jian Hui Ma ◽  
Zhi Xue Wang ◽  
Gang Wang ◽  
Yuan Yang Liu ◽  
Yan Qiang Li

This paper designed method for non-volatile data storage using MCU internal data Flash, certain data Flash sector is divided into multiple data partitions, different data partition storage data copies in different historical time, the current data partition storagethe latest copy of the data; In the data read operation, first calculate the latest data copying Flash storage location, then directly reads the address. In the data write operation, first judge if the data writing position is already erased, if not,write data in next partition, while copy the other data in the current partition to the next partition; if the write position has been erased, write data directly to the current partition. This method is similar to EEPROM data read and write, easy to operate, and give a simple application interface, and can avoid the sector erase operation, to improve storage efficiency, while increasing the service life of the MCU's internal data Flash.


Author(s):  
Hans-Peter Kriegel ◽  
Peer Kröger ◽  
Martin Pfeifle ◽  
Stefan Brecheisen ◽  
Marco Pötke ◽  
...  

Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging, and many others. Especially for CAD (Computer-Aided Design), suitable similarity models and a clear representation of the results can help to reduce the cost of developing and producing new parts by maximizing the reuse of existing parts. In this chapter, we present different similarity models for voxelized CAD data based on space partitioning and data partitioning. Based on these similarity models, we introduce anindustrial prototype, called BOSS, which helps the user to get an overview over a set of CAD objects. BOSS allows the user to easily browse large data collections by graphically displaying the results of a hierarchical clustering algorithm. This representation is well suited for the evaluation of similarity models and to aid an industrial user searching for similar parts.


Author(s):  
Michael Gruenstaeudl

The monophyly of Nymphaeaceae (water lilies) represents a critical question in understanding the evolutionary history of early-diverging angiosperms. A recent plastid phylogenomic investigation claimed new evidence for the monophyly of Nymphaeaceae, but its results could not be verified from the available data. Moreover, preliminary gene-wise analyses of the same dataset provided partial support for the paraphyly of the family. The present investigation aims to re-assess the previous conclusion of the monophyly of Nymphaeaceae under the same dataset and to determine the congruence of the phylogenetic signal across different plastome genes and data partition strategies. To that end, phylogenetic tree inference is conducted on each of 78 protein-coding plastome genes, both individually and upon concatenation, and under four data partitioning schemes. Moreover, the possible effects of various sequence variability and homoplasy metrics on the inference of specific phylogenetic relationships are tested using multiple logistic regression. Differences in the variability of polymorphic sites across codon positions are assessed using parametric and non-parametric analysis of variance. The results of the phylogenetic reconstructions indicate considerable incongruence among the different gene trees as well as the data partitioning schemes. The results of the multiple logistic regression tests indicate that the fraction of polymorphic sites of codon position 3 has a significant effect on the recovery of the monophyly of Nymphaeaceae. Taken together, these results indicate that the monophyly of Nymphaeaceae currently remains indeterminate, and that specific phylogenetic conclusions are strongly dependent on the precise plastome gene, data partitioning scheme, and codon position evaluated. In closing, I discuss the importance of archiving all data of an investigation in publicly accessible data repositories, along with sufficient details to replicate the published results, and provide recommendations on future plastid phylogenomic investigations of Nymphaeales.


2015 ◽  
Vol 11 (2) ◽  
pp. 44-61 ◽  
Author(s):  
Ladjel Bellatreche ◽  
Amira Kerkad

With the explosion of data, several applications are designed around analytical aspects, with data warehousing technology at the heart of the construction chain. The exploitation of this data warehouse is usually performed by the use of complex queries involving selections, joins and aggregations. These queries bring the following characteristics: (1) their routinely aspects, (2) their large number, and (3) the high operation sharing between queries. This interaction has been largely identified in the context of multi-query optimization, where graph data structures were proposed to capture it. Also during the physical design, the structures have been used to select redundant optimization structures such as materialized views and indexes. Horizontal data partitioning (HDP) is another non-redundant optimization structure that can be selected in the physical design phase. It is a pre-condition for designing extremely large databases in several environments: centralized, distributed, parallel and cloud. It aims to reduce the cost of the above operations. In HDP, the optimization space of potential candidates for partitioning grows exponentially with the problem size making the problem NP-hard. This paper proposes a new approach based on query interactions to select a partitioning schema of a data warehouse in a divide and conquer manner to achieve an improved trade-off between the optimization algorithm's speed and the quality of the solution. The effectiveness of our approach is proven through a validation using the Star Schema Benchmark (100 GB) on Oracle11g.


Author(s):  
Ludger Starke ◽  
Thoralf Niendorf ◽  
Sonia Waiczies

AbstractFluorine-19 MRI shows great promise for a wide range of applications including renal imaging, yet the typically low signal-to-noise ratios and sparse signal distribution necessitate a thorough data preparation.This chapter describes a general data preparation workflow for fluorine MRI experiments. The main processing steps are: (1) estimation of noise level, (2) correction of noise-induced bias and (3) background subtraction. The protocol is supplemented by an example script and toolbox available online.This chapter is based upon work from the COST Action PARENCHIMA, a community-driven network funded by the European Cooperation in Science and Technology (COST) program of the European Union, which aims to improve the reproducibility and standardization of renal MRI biomarkers. This analysis protocol chapter is complemented by two separate chapters describing the basic concept and experimental procedure.


2014 ◽  
Vol 536-537 ◽  
pp. 512-515
Author(s):  
Jian Qiang Dai

With the promotion of 3G networks and upcoming 4G network, mobile phone users in constant rise. The volume of data they produce will soar every day. A variety of data appear results in the database have rapid increasing. Some application system in access to these heterogeneous, huge databases is bound to access difficulties, the problem of low efficiency of access. Query optimization technology is proposed in this paper, based on data partitioning, by reducing the database size to solve the problem of memory and huge amounts of data access problems. From the experiment test, the method can effectively improve the efficiency of huge amounts of data access, can get satisfactory results.


Sign in / Sign up

Export Citation Format

Share Document