scholarly journals Handling data-skewness in character based string similarity join using Hadoop

2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Kanak Meena ◽  
Devendra K. Tayal ◽  
Oscar Castillo ◽  
Amita Jain

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Tomography ◽  
2021 ◽  
Vol 7 (1) ◽  
pp. 39-54
Author(s):  
Veerle Kersemans ◽  
Stuart Gilchrist ◽  
Philip Danny Allen ◽  
Sheena Wallington ◽  
Paul Kinchesh ◽  
...  

Standardisation of animal handling procedures for a wide range of preclinical imaging scanners will improve imaging performance and reproducibility of scientific data. Whilst there has been significant effort in defining how well scanners should operate and how in vivo experimentation should be practised, there is little detail on how to achieve optimal scanner performance with best practices in animal welfare. Here, we describe a system-agnostic, adaptable and extensible animal support cradle system for cardio-respiratory-synchronised, and other, multi-modal imaging of small animals. The animal support cradle can be adapted on a per application basis and features integrated tubing for anaesthetic and tracer delivery, an electrically driven rectal temperature maintenance system and respiratory and cardiac monitoring. Through a combination of careful material and device selection, we have described an approach that allows animals to be transferred whilst under general anaesthesia between any of the tomographic scanners we currently or have previously operated. The set-up is minimally invasive, cheap and easy to implement and for multi-modal, multi-vendor imaging of small animals.


2013 ◽  
Vol 25 (10) ◽  
pp. 2217-2230 ◽  
Author(s):  
Chuitian Rong ◽  
Wei Lu ◽  
Xiaoli Wang ◽  
Xiaoyong Du ◽  
Yueguo Chen ◽  
...  

GigaScience ◽  
2020 ◽  
Vol 9 (8) ◽  
Author(s):  
Stefan Paulus ◽  
Anne-Katrin Mahlein

Abstract Background The use of hyperspectral cameras is well established in the field of plant phenotyping, especially as a part of high-throughput routines in greenhouses. Nevertheless, the workflows used differ depending on the applied camera, the plants being imaged, the experience of the users, and the measurement set-up. Results This review describes a general workflow for the assessment and processing of hyperspectral plant data at greenhouse and laboratory scale. Aiming at a detailed description of possible error sources, a comprehensive literature review of possibilities to overcome these errors and influences is provided. The processing of hyperspectral data of plants starting from the hardware sensor calibration, the software processing steps to overcome sensor inaccuracies, and the preparation for machine learning is shown and described in detail. Furthermore, plant traits extracted from spectral hypercubes are categorized to standardize the terms used when describing hyperspectral traits in plant phenotyping. A scientific data perspective is introduced covering information for canopy, single organs, plant development, and also combined traits coming from spectral and 3D measuring devices. Conclusions This publication provides a structured overview on implementing hyperspectral imaging into biological studies at greenhouse and laboratory scale. Workflows have been categorized to define a trait-level scale according to their metrological level and the processing complexity. A general workflow is shown to outline procedures and requirements to provide fully calibrated data of the highest quality. This is essential for differentiation of the smallest changes from hyperspectral reflectance of plants, to track and trace hyperspectral development as an answer to biotic or abiotic stresses.


2014 ◽  
Vol 7 (8) ◽  
pp. 625-636 ◽  
Author(s):  
Yu Jiang ◽  
Guoliang Li ◽  
Jianhua Feng ◽  
Wen-Syan Li

2014 ◽  
Vol 9 (10) ◽  
Author(s):  
Peisen Yuan ◽  
Haoyun Wang ◽  
Jianghua Che ◽  
Shougang Ren ◽  
Huanliang Xu ◽  
...  

2020 ◽  
Author(s):  
Annika Kuß ◽  
Dagmar Kubistin ◽  
Robert Holla ◽  
Christian Plaß-Dülmer ◽  
Erasmus Tensing ◽  
...  

<p>As a toxic and reactive gas, nitrogen dioxide (NO<sub>2</sub>) influences air quality and health, the self-cleaning power of the atmosphere and photochemical smog formation. Reliable scientific data with high quality and comparability are required for national and international decision-makers. The quality of the NO<sub>2</sub> measurements is crucially dependent on the quality of the calibration standards. In order to achieve the quality goals required, the MetNO2 project within the EMPIR (European Metrology Program for Innovation and Research) program aims to provide accurate and stable NO<sub>2</sub> calibration standards for operational use at air quality stations.</p><p>To characterise the impurities of the newly developed standards a Thermal Dissociation - Cavity Attenuated Phase Shift (TD - CAPS) system has been set up, based on the design from Sadanaga et al. (2016). The device includes four heated channels for the differentiation of NO<sub>2</sub>, peroxy and alkyl nitrates and HNO<sub>3</sub>. In parallel, a gold converter coupled with a chemiluminescence detector was deployed for detection of the total sum of NO<sub>y</sub>. First results of the performance of the TD-CAPS used for impurity analysis of NO<sub>2</sub> standards will be presented.</p><p> </p><p>Reference: Sadanaga et al. Review of Scientific Instruments 87.7 (2016), 074102</p>


2011 ◽  
Vol 21 (4) ◽  
pp. 437-461 ◽  
Author(s):  
Jianhua Feng ◽  
Jiannan Wang ◽  
Guoliang Li

Sign in / Sign up

Export Citation Format

Share Document