MapReduce Algorithm for Variants of Skyline Queries: Skyband and Dominating Queries

The skyline query and its variant queries are useful functions in the early stages of a knowledge-discovery processes. The skyline query and its variant queries select a set of important objects, which are better than other common objects in the dataset. In order to handle big data, such knowledge-discovery queries must be computed in parallel distributed environments. In this paper, we consider an efficient parallel algorithm for the “K-skyband query” and the “top-k dominating query”, which are popular variants of skyline query. We propose a method for computing both queries simultaneously in a parallel distributed framework called MapReduce, which is a popular framework for processing “big data” problems. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithm on both real and synthetic datasets.

Download Full-text

Researching Why-Not Questions in Skyline Query Based on Orthogonal Range

Electronics ◽

10.3390/electronics9030500 ◽

2020 ◽

Vol 9 (3) ◽

pp. 500 ◽

Cited By ~ 1

Author(s):

Ping Sun ◽

Caimei Liang ◽

Guohui Li ◽

Ling Yuan

Keyword(s):

Experimental Results ◽

Skyline Query ◽

High Quality ◽

Query Refinement ◽

Skyline Queries ◽

The Real ◽

Query Efficiency ◽

Synthetic Datasets

This paper aims to answer “why-not” questions in skyline queries based on the orthogonal query range (i.e., ORSQ). These queries retrieve skyline points within a rectangular query range, which improves query efficiency. Answering why-not questions in ORSQ can help users analyze query results and make decisions. We discuss the causes of why-not questions in ORSQ. Then, we outline how to modify the why-not point and the orthogonal query range so that the why-not point is included in the result of the skyline query based on the orthogonal range. When the why-not point is in the orthogonal range, we show how to modify the why-not point and narrow the orthogonal range. We also present how to expand the orthogonal range when the why-not point is not in the orthogonal range. We effectively combine query refinement and data modification techniques to produce meaningful answers. The experimental results demonstrate that the proposed algorithms have high-quality explanations for why-not questions in ORSQ in the real and synthetic datasets.

Download Full-text

SetSketch

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476276 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2244-2257

Author(s):

Otmar Ertl

Keyword(s):

Big Data ◽

Data Structure ◽

Data Structures ◽

Similarity Search ◽

State Of The Art ◽

Use Cases ◽

Distributed Environments ◽

Jaccard Similarity ◽

Big Data Applications ◽

Better Than

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or Hyper-MinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

Download Full-text

Sampling-based approximate skyline calculation on big data

Discrete Mathematics Algorithms and Applications ◽

10.1142/s1793830922500240 ◽

2021 ◽

Author(s):

Xingxing Xiao ◽

Jianzhong Li

Keyword(s):

Exact Solution ◽

Big Data ◽

Approximate Solution ◽

Error Analysis ◽

Linear Time ◽

Approximate Algorithms ◽

Skyline Query ◽

Fixed Size ◽

Skyline Queries ◽

Input Size

Nowadays, big data is coming to the force in a lot of applications. Processing a skyline query on big data in more than linear time is by far too expensive and often even linear time may be too slow. It is obviously not possible to compute an exact solution to a skyline query in sublinear time, since an exact solution may itself have linear size. Fortunately, in many situations, a fast approximate solution is more useful than a slower exact solution. This paper proposes two sampling-based approximate algorithms for processing skyline queries. The first algorithm obtains a fixed size sample and computes the approximate skyline on it. The error of the algorithm is not only relatively small in most cases, but also is almost unaffected by the input size. The second algorithm returns an [Formula: see text]-approximation for the exact skyline efficiently. The running time of the algorithm has nothing to do with the input size in practical, achieving the goal of sublinearity on big data. Experiments verify the error analysis of the first algorithm, and show that the second is much faster than the existing skyline algorithms.

Download Full-text

Comparative Study on Skyline Query Processing Techniques on Big Data

2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) ◽

10.1109/i-smac49090.2020.9243343 ◽

2020 ◽

Cited By ~ 1

Author(s):

Praveen Kumar Sadineni

Keyword(s):

Big Data ◽

Comparative Study ◽

Query Processing ◽

Skyline Query ◽

Skyline Query Processing ◽

Processing Techniques

Download Full-text

Teaching research on college English translation in the era of big data

International Journal of Electrical Engineering Education ◽

10.1177/0020720920984316 ◽

2021 ◽

pp. 002072092098431

Author(s):

Yue Liu ◽

Hongyan Bai

Keyword(s):

Big Data ◽

Learning Theory ◽

English Translation ◽

Design Theory ◽

Colleges And Universities ◽

Translation Process ◽

Constructivist Learning Theory ◽

College English ◽

Teaching Time ◽

Better Than

With the development of the big data era and the opening of translation majors in colleges and universities, translation teaching is gradually receiving attention. However, there are still many problems in the training of translators in colleges and universities in terms of teachers, teaching time and teaching mode. In the context of the era of big data, this article uses questionnaires and data analysis, starting from the PACTE translation ability model, combined with constructivist learning theory, blended learning theory, and instructional design theory to analyze the problems of undergraduate translation ability. This article conducts a questionnaire survey on the 2018 students of XX University’s a major, and analyzes their English scores. Students’ bilingual ability is weak, and it is difficult to consider translation under the influence of context in the translation process; their strategic ability is not ideal, and they lack the ability to solve problems when they encounter specific translation problems. The English performance of the experimental class students who have undergone English translation teaching for one semester is significantly better than the control class students who have not received English translation teaching. Teachers can combine teaching theories to design English translation teaching and cultivate students’ awareness of comparative analysis in English learning. Teachers can cultivate students’ English thinking ability, promote them to master English better, and help them improve their English application ability.

Download Full-text

Application of Improved Recommendation System Based on Spark Platform in Big Data Analysis

Cybernetics and Information Technologies ◽

10.1515/cait-2016-0092 ◽

2016 ◽

Vol 16 (6) ◽

pp. 245-255 ◽

Cited By ~ 1

Author(s):

Li Xie ◽

Wenbo Zhou ◽

Yaosen Li

Keyword(s):

Big Data ◽

Recommender System ◽

Recommendation System ◽

Parallel Implementation ◽

Implementation Process ◽

Recommendation Algorithm ◽

Filtration Problem ◽

Information Filtration ◽

Similarity Information ◽

Better Than

Abstract In the era of big data, people have to face information filtration problem. For those cases when users do not or cannot express their demands clearly, recommender system can analyse user’s information more proactive and intelligent to filter out something users want. This property makes recommender system play a very important role in the field of e-commerce, social network and so on. The collaborative filtering recommendation algorithm based on Alternating Least Squares (ALS) is one of common algorithms using matrix factorization technique of recommendation system. In this paper, we design the parallel implementation process of the recommendation algorithm based on Spark platform and the related technology research of recommendation systems. Because of the shortcomings of the recommendation algorithm based on ALS model, a new loss function is designed. Before the model is trained, the similarity information of users and items is fused. The experimental results show that the performance of the proposed algorithm is better than that of algorithm based on ALS.

Download Full-text

Evaluation of the effectiveness and efficiency of state-of-the-art features and models for automatic speech recognition error detection

Journal Of Big Data ◽

10.1186/s40537-020-00391-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Asmaa El Hannani ◽

Rahhal Errattahi ◽

Fatima Zahra Salmam ◽

Thomas Hain ◽

Hassan Ouahmane

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Detection ◽

State Of The Art ◽

Rapid Development ◽

Unified Framework ◽

Human Machine Interaction ◽

Detection Analysis ◽

Extensive Evaluation ◽

Effectiveness And Efficiency

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.

Download Full-text

Day-Ahead Forecasting of Hourly Photovoltaic Power Based on Robust Multilayer Perception

Sustainability ◽

10.3390/su10124863 ◽

2018 ◽

Vol 10 (12) ◽

pp. 4863 ◽

Cited By ~ 6

Author(s):

Chao Huang ◽

Longpeng Cao ◽

Nanxin Peng ◽

Sijia Li ◽

Jing Zhang ◽

...

Keyword(s):

Power Plants ◽

Mean Squared Error ◽

Absolute Error ◽

Multilayer Perception ◽

Squared Error ◽

The Mean ◽

Effectiveness And Efficiency ◽

Mlp Network ◽

Grid Operation ◽

Better Than

Photovoltaic (PV) modules convert renewable and sustainable solar energy into electricity. However, the uncertainty of PV power production brings challenges for the grid operation. To facilitate the management and scheduling of PV power plants, forecasting is an essential technique. In this paper, a robust multilayer perception (MLP) neural network was developed for day-ahead forecasting of hourly PV power. A generic MLP is usually trained by minimizing the mean squared loss. The mean squared error is sensitive to a few particularly large errors that can lead to a poor estimator. To tackle the problem, the pseudo-Huber loss function, which combines the best properties of squared loss and absolute loss, was adopted in this paper. The effectiveness and efficiency of the proposed method was verified by benchmarking against a generic MLP network with real PV data. Numerical experiments illustrated that the proposed method performed better than the generic MLP network in terms of root mean squared error (RMSE) and mean absolute error (MAE).

Download Full-text