A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

2021 ◽  
Vol 2 (5) ◽  
Author(s):  
Anh-Cang Phan ◽  
Thuong-Cang Phan ◽  
Thanh-Ngoan Trieu ◽  
Thi-To-Quyen Tran
Author(s):  
Jonathan D. Realmuto ◽  
Suresh B. Sadineni ◽  
Srikanth Madala ◽  
Robert F. Boehm

The photovoltaic (PV) industry has seen remarkable progress in recent years, especially considering the advancement in materials and cell architecture. The potential of these technologies is investigated in a high insolation region of Southwestern United States, namely Las Vegas, where there is an abundance of surrounding barren land available for large scale installations. An experimental comparison of different PV technologies (HIT-Si, poly-c-Si, a-Si, and triple junction a-Si) under identical climatic conditions is the basis of this study. All tested modules have identical operating conditions, i.e. fixed installation plane, geographic location, and climatic conditions. The experiment verifies thin-film’s temperature independency, HIT-Si’s superior performance, and summarizes winter energy production of popular technologies in our climate. Lastly, an economic analysis is performed to compare the different technologies for prospective utility scale PV installations in southern Nevada, or similar climatic regions.


2005 ◽  
Vol 68 (10) ◽  
pp. 2163-2168 ◽  
Author(s):  
RICHARD PEPPERELL ◽  
CAROL-ANN REID ◽  
SILVIA NICOLAU SOLANO ◽  
MICHAEL L. HUTCHISON ◽  
LISA D. WALTERS ◽  
...  

Bovine sides, ovine carcasses, and porcine carcasses were individually inoculated by dipping in various suspensions of a marker organism (Escherichia coli K-12 or Pseudomonas fluorescens), alone or in combination with two meat-derived bacterial strains, and were sampled by two standard methods: cotton wet-dry swabbing and excision. The samples were examined for bacterial counts on plate count agar (PCA plate counts) and on violet red brilliant green agar (VRBGA plate counts) by standard International Organization for Standardization methods. Average bacterial recoveries by swabbing, expressed as a percentage of the appropriate recoveries achieved by excision, varied widely (2 to 100%). Several factors that potentially contributed to relatively low and highly variable bacterial recoveries obtained by swabbing were investigated in separate experiments. Neither the difference in size of the swabbed area (10, 50, or 100 cm2 on beef carcasses) nor the difference in time of swabbing (20 or 60 min after inoculation of pig carcasses) had a significant effect on the swabbing recoveries of the marker organism used. In an experiment with swabs preinoculated with the marker organism and then used for carcass swabbing, on average, 12% of total bacterial load was transferred inversely (i.e., from the swab to the carcass during the standard swabbing procedure). In another experiment, on average, 14% of total bacterial load was not released from the swab into the diluent during standard swab homogenization. Use of custom-made swabs with abrasive butts, around which metal pieces of pan scourers were wound, markedly increased PCA plate count recoveries from noninoculated lamb carcasses at commercial abattoirs compared with cotton swabs. In spite of the observed inferiority of the cotton wet-dry swabbing method compared with the excision method for bacterial recovery, the former is clearly preferred by the meat industry because it does not damage the carcass. Therefore, further large-scale evaluation of the two carcass sampling methods has been undertaken under commercial conditions and reported separately.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 407
Author(s):  
Jiayan Shen ◽  
Xiucheng Guo ◽  
Wenzong Zhou ◽  
Yiming Zhang ◽  
Juchen Li

Aerial images are large-scale and susceptible to light. Traditional image feature point matching algorithms cannot achieve satisfactory matching accuracy for aerial images. This paper proposes a recursive diffusion algorithm, which is scale-invariant and can be used to extract symmetrical areas of different images. This narrows the matching range of feature points by extracting high-density areas of the image and improving the matching accuracy through correlation analysis of high-density areas. Through experimental comparison, it can be found that the recursive diffusion algorithm has more advantages compared to the correlation coefficient method and the mean shift algorithm when matching accuracy of aerial images, especially when the light of aerial images changes greatly.


2021 ◽  
pp. 1-18
Author(s):  
Salahaldeen Rababa ◽  
Amer Al-Badarneh

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.


2014 ◽  
Vol 21 (2) ◽  
pp. 129-145
Author(s):  
Hans Westerbeek ◽  
Marije van Amelsvoort ◽  
Alfons Maes ◽  
Marc Swerts

We present an analytic and a large scale experimental comparison of two informationally equivalent information displays of soccer statistics. Both displays were presented by the BBC during the 2010 FIFA World Cup. The displays mainly differ in terms of the number and types of cognitively natural mappings between visual variables and meaning. Theoretically, such natural form-meaning mappings help users to interpret the information quickly and easily. However, our analysis indicates that the design which contains most of these mappings is inevitably inconsistent in how forms and meanings are mapped to each other. The experiment shows that this inconsistency was detrimental for how fast people can find information in the display and for which display people prefer to use. Our findings shed new light on the well-established cognitive design principle of natural mapping: while in theory, information designs may benefit from natural mapping, in practice its applicability may be limited. Information designs that contain a high number of form-meaning mappings, for example, for aesthetic reasons, risk being inconsistent and too complex for users, leading them to find information less quickly and less easily.


2016 ◽  
Vol 4 (4) ◽  
pp. 508-530 ◽  
Author(s):  
CHRISTIAN L. STAUDT ◽  
ALEKSEJS SAZONOVS ◽  
HENNING MEYERHENKE

AbstractWe introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python frontend, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.


2020 ◽  
Vol 54 (2) ◽  
pp. 1-2
Author(s):  
Harrie Oosterhuis

Ranking systems form the basis for online search engines and recommendation services. They process large collections of items, for instance web pages or e-commerce products, and present the user with a small ordered selection. The goal of a ranking system is to help a user find the items they are looking for with the least amount of effort. Thus the rankings they produce should place the most relevant or preferred items at the top of the ranking. Learning to rank is a field within machine learning that covers methods which optimize ranking systems w.r.t. this goal. Traditional supervised learning to rank methods utilize expert-judgements to evaluate and learn, however, in many situations such judgements are impossible or infeasible to obtain. As a solution, methods have been introduced that perform learning to rank based on user clicks instead. The difficulty with clicks is that they are not only affected by user preferences, but also by what rankings were displayed. Therefore, these methods have to prevent being biased by other factors than user preference. This thesis concerns learning to rank methods based on user clicks and specifically aims to unify the different families of these methods. The first part of the thesis consists of three chapters that look at online learning to rank algorithms which learn by directly interacting with users. Its first chapter considers large scale evaluation and shows existing methods do not guarantee correctness and user experience, we then introduce a novel method that can guarantee both. The second chapter proposes a novel pairwise method for learning from clicks that contrasts with the previous prevalent dueling-bandit methods. Our experiments show that our pairwise method greatly outperforms the dueling-bandit approach. The third chapter further confirms these findings in an extensive experimental comparison, furthermore, we also show that the theory behind the dueling-bandit approach is unsound w.r.t. deterministic ranking systems. The second part of the thesis consists of four chapters that look at counterfactual learning to rank algorithms which learn from historically logged click data. Its first chapter takes the existing approach and makes it applicable to top- k settings where not all items can be displayed at once. It also shows that state-of-the-art supervised learning to rank methods can be applied in the counterfactual scenario. The second chapter introduces a method that combines the robust generalization of feature-based models with the high-performance specialization of tabular models. The third chapter looks at evaluation and introduces a method for finding the optimal logging policy that collects click data in a way that minimizes the variance of estimated ranking metrics. By applying this method during the gathering of clicks, one can turn counterfactual evaluation into online evaluation. The fourth chapter proposes a novel counterfactual estimator that considers the possibility that the logging policy has been updated during the gathering of click data. As a result, it can learn much more efficiently when deployed in an online scenario where interventions can take place. The resulting approach is thus both online and counterfactual, our experimental results show that its performance matches the state-of-the-art in both the online and the counterfactual scenario. As a whole, the second part of this thesis proposes a framework that bridges many gaps between areas of online, counterfactual, and supervised learning to rank. It has taken approaches, previously considered independent, and unified them into a single methodology for widely applicable and effective learning to rank from user clicks. Awarded by: University of Amsterdam, Amsterdam, The Netherlands. Supervised by: Maarten de Rijke. Available at: https://hdl.handle.net/11245.1/8ff3aa38-97fb-4d2a-8127-a29a03af4d5c.


Entropy ◽  
2020 ◽  
Vol 22 (6) ◽  
pp. 643 ◽  
Author(s):  
Qianchen Xia ◽  
Jianghua Lv ◽  
Shilong Ma ◽  
Bocheng Gao ◽  
Zhenhua Wang

With the development of online advertising technology, the accurate targeted advertising based on user preferences is obviously more suitable both for the market and users. The amount of conversion can be properly increased by predicting the user’s purchasing intention based on the advertising Conversion Rate (CVR). According to the high-dimensional and sparse characteristics of the historical behavior sequences, this paper proposes a LSLM_LSTM model, which is for the advertising CVR prediction based on large-scale sparse data. This model aims at minimizing the loss, utilizing the Adaptive Moment Estimation (Adam) optimization algorithm to mine the nonlinear patterns hidden in the data automatically. Through the experimental comparison with a variety of typical CVR prediction models, it is found that the proposed LSLM_LSTM model can utilize the time series characteristics of user behavior sequences more effectively, as well as mine the potential relationship hidden in the features, which brings higher accuracy and trains faster compared to those with consideration of only low or high order features.


Sign in / Sign up

Export Citation Format

Share Document