Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

The amount of data belonging to different domains are being stored rapidly in various repositories across the globe. Extracting useful information from the huge volumes of data is always difficult due to the dynamic nature of data being stored. Data Mining is a knowledge discovery process used to extract the hidden information from the data stored in various repositories, termed as warehouses in the form of patterns. One of the popular tasks of data mining is Classification, which deals with the process of distinguishing every instance of a data set into one of the predefined class labels. Banking system is one of the realworld domains, which collects huge number of client data on a daily basis. In this work, we have collected two variants of the bank marketing data set pertaining to a Portuguese financial institution consisting of 41188 and 45211 instances and performed classification on them using two data reduction techniques. Attribute subset selection has been performed on the first data set and the training data with the selected features are used in classification. Principal Component Analysis has been performed on the second data set and the training data with the extracted features are used in classification. A deep neural network classification algorithm based on Backpropagation has been developed to perform classification on both the data sets. Finally, comparisons are made on the performance of each deep neural network classifier with the four standard classifiers, namely Decision trees, Naïve Bayes, Support vector machines, and k-nearest neighbors. It has been found that the deep neural network classifier outperforms the existing classifiers in terms of accuracy

Download Full-text

Comparison of Different Training Data Reduction Approaches for Fast Support Vector Machines Based on Principal Component Analysis and Distance Based Measurements

International Journal of Computational and Experimental Science and Engineering ◽

10.22399/ijcesen.374222 ◽

2018 ◽

Vol 4 (1) ◽

pp. 1-5 ◽

Cited By ~ 1

Author(s):

Gür Emre Güraksın ◽

Harun Uğuz

Keyword(s):

Principal Component Analysis ◽

Support Vector Machines ◽

Data Reduction ◽

Principal Component ◽

Component Analysis ◽

Training Data ◽

Support Vector ◽

Vector Machines

Download Full-text

Training data reduction to speed up SVM training

Applied Intelligence ◽

10.1007/s10489-014-0524-2 ◽

2014 ◽

Vol 41 (2) ◽

pp. 405-420 ◽

Cited By ~ 7

Author(s):

Senzhang Wang ◽

Zhoujun Li ◽

Chunyang Liu ◽

Xiaoming Zhang ◽

Haijun Zhang

Keyword(s):

Data Reduction ◽

Training Data ◽

Speed Up

Download Full-text

From Ethnographic Research to Big Data Analytics—A Case of Maritime Energy-Efficiency Optimization

Applied Sciences ◽

10.3390/app10062134 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2134 ◽

Cited By ~ 1

Author(s):

Yemao Man ◽

Tobias Sturm ◽

Monica Lundh ◽

Scott N. MacKinnon

Keyword(s):

Energy Efficiency ◽

Big Data ◽

Data Analytics ◽

Management Practices ◽

Big Data Analytics ◽

Training Data ◽

Ethnographic Research ◽

Data Sets ◽

Shipping Industry ◽

One Year

The shipping industry constantly strives to achieve efficient use of energy during sea voyages. Previous research that can take advantages of both ethnographic studies and big data analytics to understand factors contributing to fuel consumption and seek solutions to support decision making is rather scarce. This paper first employed ethnographic research regarding the use of a commercially available fuel-monitoring system. This was to contextualize the real challenges on ships and informed the need of taking a big data approach to achieve energy efficiency (EE). Then this study constructed two machine-learning models based on the recorded voyage data of five different ferries over a one-year period. The evaluation showed that the models generalize well on different training data sets and model outputs indicated a potential for better performance than the existing commercial EE system. How this predictive-analytical approach could potentially impact the design of decision support navigational systems and management practices was also discussed. It is hoped that this interdisciplinary research could provide some enlightenment for a richer methodological framework in future maritime energy research.

Download Full-text

Training data reduction for nonlinear state estimator

2015 10th Asian Control Conference (ASCC) ◽

10.1109/ascc.2015.7244669 ◽

2015 ◽

Author(s):

Hiroaki Ishiyama ◽

Masaki Yamakita

Keyword(s):

Data Reduction ◽

Training Data ◽

State Estimator ◽

Nonlinear State

Download Full-text

Performance Models of Data Parallel DAG Workflows for Large Scale Data Analytics

2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW) ◽

10.1109/icdew53142.2021.00026 ◽

2021 ◽

Author(s):

Juwei Shi ◽

Jiaheng Lu

Keyword(s):

Data Analytics ◽

Large Scale ◽

Performance Models ◽

Data Parallel ◽

Large Scale Data ◽

Scale Data

Download Full-text

Predictive privacy: towards an applied ethics of data analytics

Ethics and Information Technology ◽

10.1007/s10676-021-09606-x ◽

2021 ◽

Author(s):

Rainer Mühlhoff

Keyword(s):

Machine Learning ◽

Data Protection ◽

Data Analytics ◽

Predictive Analytics ◽

Applied Ethics ◽

Big Data Analytics ◽

Training Data ◽

Sensitive Information ◽

Ethical Implications ◽

Individual Privacy

AbstractData analytics and data-driven approaches in Machine Learning are now among the most hailed computing technologies in many industrial domains. One major application is predictive analytics, which is used to predict sensitive attributes, future behavior, or cost, risk and utility functions associated with target groups or individuals based on large sets of behavioral and usage data. This paper stresses the severe ethical and data protection implications of predictive analytics if it is used to predict sensitive information about single individuals or treat individuals differently based on the data many unrelated individuals provided. To tackle these concerns in an applied ethics, first, the paper introduces the concept of “predictive privacy” to formulate an ethical principle protecting individuals and groups against differential treatment based on Machine Learning and Big Data analytics. Secondly, it analyses the typical data processing cycle of predictive systems to provide a step-by-step discussion of ethical implications, locating occurrences of predictive privacy violations. Thirdly, the paper sheds light on what is qualitatively new in the way predictive analytics challenges ethical principles such as human dignity and the (liberal) notion of individual privacy. These new challenges arise when predictive systems transform statistical inferences, which provide knowledge about the cohort of training data donors, into individual predictions, thereby crossing what I call the “prediction gap”. Finally, the paper summarizes that data protection in the age of predictive analytics is a collective matter as we face situations where an individual’s (or group’s) privacy is violated using data other individuals provide about themselves, possibly even anonymously.

Download Full-text

Efficient data abstraction using weighted IB2 prototypes

Computer Science and Information Systems ◽

10.2298/csis140212036o ◽

2014 ◽

Vol 11 (2) ◽

pp. 665-678 ◽

Cited By ~ 3

Author(s):

Stefanos Ougiaroglou ◽

Georgios Evangelidis

Keyword(s):

Data Reduction ◽

Training Data ◽

Training Dataset ◽

Data Abstraction ◽

Prototype Selection ◽

Initial Training ◽

Reduction Techniques ◽

Efficient Data ◽

Data Reduction Technique ◽

Better Than

Data reduction techniques improve the efficiency of k-Nearest Neighbour classification on large datasets since they accelerate the classification process and reduce storage requirements for the training data. IB2 is an effective prototype selection data reduction technique. It selects some items from the initial training dataset and uses them as representatives (prototypes). Contrary to many other techniques, IB2 is a very fast, one-pass method that builds its reduced (condensing) set in an incremental manner. New training data can update the condensing set without the need of the ?old? removed items. This paper proposes a variation of IB2, that generates new prototypes instead of selecting them. The variation is called AIB2 and attempts to improve the efficiency of IB2 by positioning the prototypes in the center of the data areas they represent. The empirical experimental study conducted in the present work as well as the Wilcoxon signed ranks test show that AIB2 performs better than IB2.

Download Full-text