join algorithm Latest Research Papers

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

Download Full-text

In-Memory Interval Joins

The VLDB Journal ◽

10.1007/s00778-020-00639-0 ◽

2021 ◽

Author(s):

Panagiotis Bouros ◽

Nikos Mamoulis ◽

Dimitrios Tsitsigkos ◽

Manolis Terrovitis

Keyword(s):

Parallel Computation ◽

State Of The Art ◽

Complex Data ◽

Plane Sweep ◽

Join Algorithm ◽

Sweep Algorithm ◽

Join Algorithms ◽

Domain Partitioning ◽

Complex Data Structure ◽

Independent Tasks

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.

Download Full-text

A Vehicle ID identification Architecture: A Parallel-Joining WSN Algorithm

Iraqi Journal of Science ◽

10.24996/ijs.2021.si.1.37 ◽

2021 ◽

pp. 267-270

Author(s):

Sami Hasan ◽

Abdulhakeem Amer

Keyword(s):

Sensor Networks ◽

Sensor Network ◽

Communication Cost ◽

Sensor Data ◽

Join Algorithm ◽

Battery Power ◽

Remote Sensor ◽

Speed Up ◽

Sensor Information

Several remote sensor network (WSN) tasks require sensor information join. This in-processing Join is configured in parallel sensor hub to save battery power and limit the communication cost. Hence, a parallel join system is proposed for sensor networks. The proposed parallel join algorithm organizes in section-situated databases. A novel join method has been proposed for remote WSNs to limit the aggregate communication cost and enhance execution. This approach depends on two procedures; section-situated databases and parallel join algorithm utilized to store sensor information and speed up processing respectively. A segment arranged databases store information table in segmented shrewd. The Parallel-Joining WSN algorithm is effectively feasible for two clear reasons. Firstly, the decisive join conveyed fragments. Secondly, parallel-joining is in the fly processed sensor data. Creatively, a parallel dispersed algorithm has been developed to gain time compared to the single disseminated algorithm.

Download Full-text

AN EFFICIENT APPROACH FOR IMPROVING RECURSIVE JOIN ALGORITHM ON LARGE DATASETS

KỶ YẾU HỘI NGHỊ KHOA HỌC CÔNG NGHỆ QUỐC GIA LẦN THỨ XIII NGHIÊN CỨU CƠ BẢN VÀ ỨNG DỤNG CÔNG NGHỆ THÔNG TIN - Proceedings of the 13th National Conference on Fundamental & Applied Information Technology Research ◽

10.15625/vap.2020.00175 ◽

2020 ◽

Author(s):

Trieu Thanh Ngoan ◽

Phan Anh Cang ◽

Phan Thuong Cang

Keyword(s):

Large Datasets ◽

Efficient Approach ◽

Join Algorithm

Download Full-text

A Note on Ultrametric Spaces, Minimum Spanning Trees and the Topological Distance Algorithm

Information ◽

10.3390/info11090418 ◽

2020 ◽

Vol 11 (9) ◽

pp. 418

Author(s):

Jörg Schäfer

Keyword(s):

Spanning Tree ◽

Spanning Trees ◽

Minimum Spanning Tree ◽

Greedy Algorithms ◽

Reconstruction Algorithm ◽

Minimum Spanning Trees ◽

Ultrametric Spaces ◽

Correctness Proof ◽

Topological Distance ◽

Join Algorithm

We relate the definition of an ultrametric space to the topological distance algorithm—an algorithm defined in the context of peer-to-peer network applications. Although (greedy) algorithms for constructing minimum spanning trees such as Prim’s or Kruskal’s algorithm have been known for a long time, they require the complete graph to be specified and the weights of all edges to be known upfront in order to construct a minimum spanning tree. However, if the weights of the underlying graph stem from an ultrametric, the minimum spanning tree can be constructed incrementally and it is not necessary to know the full graph in advance. This is possible, because the join algorithm responsible for joining new nodes on behalf of the topological distance algorithm is independent of the order in which the nodes are added due to the property of an ultrametric. Apart from the mathematical elegance which some readers might find interesting in itself, this provides not only proofs (and clearer ones in the opinion of the author) for optimality theorems (i.e., proof of the minimum spanning tree construction) but a simple proof for the optimality of the reconstruction algorithm omitted in previous publications too. Furthermore, we define a new algorithm by extending the join algorithm to minimize the topological distance and (network) latency together and provide a correctness proof.

Download Full-text

Pushing the Scalability of RDF Engines on IoT Edge Devices

Sensors ◽

10.3390/s20102788 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2788

Author(s):

Anh Le-Tuan ◽

Conor Hayes ◽

Manfred Hauswirth ◽

Danh Le-Phuoc

Keyword(s):

Buffer Management ◽

Semantic Integration ◽

Management Technique ◽

Computer Design ◽

Join Algorithm ◽

Memory Footprint ◽

Description Framework ◽

Resource Description ◽

The Internet Of Things ◽

Better Than

Semantic interoperability for the Internet of Things (IoT) is enabled by standards and technologies from the Semantic Web. As recent research suggests a move towards decentralised IoT architectures, we have investigated the scalability and robustness of RDF (Resource Description Framework)engines that can be embedded throughout the architecture, in particular at edge nodes. RDF processing at the edge facilitates the deployment of semantic integration gateways closer to low-level devices. Our focus is on how to enable scalable and robust RDF engines that can operate on lightweight devices. In this paper, we have first carried out an empirical study of the scalability and behaviour of solutions for RDF data management on standard computing hardware that have been ported to run on lightweight devices at the network edge. The findings of our study shows that these RDF store solutions have several shortcomings on commodity ARM (Advanced RISC Machine) boards that are representative of IoT edge node hardware. Consequently, this has inspired us to introduce a lightweight RDF engine, which comprises an RDF storage and a SPARQL processor for lightweight edge devices, called RDF4Led. RDF4Led follows the RISC-style (Reduce Instruction Set Computer) design philosophy. The design constitutes a flash-aware storage structure, an indexing scheme, an alternative buffer management technique and a low-memory-footprint join algorithm that demonstrates improved scalability and robustness over competing solutions. With a significantly smaller memory footprint, we show that RDF4Led can handle 2 to 5 times more data than popular RDF engines such as Jena TDB (Tuple Database) and RDF4J, while consuming the same amount of memory. In particular, RDF4Led requires 10%–30% memory of its competitors to operate on datasets of up to 50 million triples. On memory-constrained ARM boards, it can perform faster updates and can scale better than Jena TDB and Virtuoso. Furthermore, we demonstrate considerably faster query operations than Jena TDB and RDF4J.

Download Full-text

SETJoin: a novel top-k similarity join algorithm

Soft Computing ◽

10.1007/s00500-020-04807-w ◽

2020 ◽

Vol 24 (19) ◽

pp. 14577-14592

Author(s):

Hongya Wang ◽

Lihong Yang ◽

Yingyuan Xiao

Keyword(s):

Similarity Join ◽

Join Algorithm

Download Full-text

A-DSP: An Adaptive Join Algorithm for Dynamic Data Stream on Cloud System

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2019.2947055 ◽

2020 ◽

pp. 1-1

Author(s):

Junhua Fang ◽

Rong Zhang ◽

Yan Zhao ◽

Kai Zheng ◽

Xiaofang Zhou ◽

...

Keyword(s):

Data Stream ◽

Cloud System ◽

Dynamic Data ◽

Join Algorithm

Download Full-text

Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data Warehousing

Journal of Database Management ◽

10.4018/jdm.2020010102 ◽

2020 ◽

Vol 31 (1) ◽

pp. 20-37 ◽

Cited By ~ 1

Author(s):

M. Asif Naeem ◽

Erum Mehmood ◽

M. G. Abbas Malik ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehousing ◽

Streaming Data ◽

Service Rate ◽

Stream Data ◽

Time Data ◽

Join Algorithm ◽

Real Time Data ◽

Two Phases ◽

Critical Process

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.

Download Full-text