Performance Assessment of Learning Algorithms on Multi-Domain Data Sets

Author(s):  
Amit Kumar ◽  
Bikash Kanti Sarkar

This article describes how for the last few decades, data mining research has had significant progress in a wide spectrum of applications. Research in prediction of multi-domain data sets is a challenging task due to the imbalanced, voluminous, conflicting, and complex nature of data sets. A learning algorithm is the most important technique for solving these problems. The learning algorithms are widely used for classification purposes. But choosing the learners that perform best for data sets of particular domains is a challenging task in data mining. This article provides a comparative performance assessment of various state-of-the-art learning algorithms over multi-domain data sets to search the effective classifier(s) for a particular domain, e.g., artificial, natural, semi-natural, etc. In the present article, a total of 14 real world data sets are selected from University of California, Irvine (UCI) machine learning repository for conducting experiments using three competent individual learners and their hybrid combinations.

2018 ◽  
Vol 26 (1) ◽  
pp. 43-66 ◽  
Author(s):  
Uday Kamath ◽  
Carlotta Domeniconi ◽  
Kenneth De Jong

Many real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this article we discuss a meta-learning algorithm (PSBML) that combines concepts from spatially structured evolutionary algorithms (SSEAs) with concepts from ensemble and boosting methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in combination with a variety of learning classifiers. We perform extensive experiments to investigate the trade-off achieved between scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy.


Author(s):  
Hoda Heidari ◽  
Andreas Krause

We study fairness in sequential decision making environments, where at each time step a learning algorithm receives data corresponding to a new individual (e.g. a new job application) and must make an irrevocable decision about him/her (e.g. whether to hire the applicant) based on observations made so far. In order to prevent cases of disparate treatment, our time-dependent notion of fairness requires algorithmic decisions to be consistent: if two individuals are similar in the feature space and arrive during the same time epoch, the algorithm must assign them to similar outcomes. We propose a general framework for post-processing predictions made by a black-box learning model, that guarantees the resulting sequence of outcomes is consistent. We show theoretically that imposing consistency will not significantly slow down learning. Our experiments on two real-world data sets illustrate and confirm this finding in practice.


Author(s):  
YI-CHUNG HU

Flow-based methods based on the outranking relation theory are extensively used in multiple criteria classification problems. Flow-based methods usually employed an overall preference index representing the flow to measure the intensity of preference for one pattern over another pattern. A traditional flow obtained by the pairwise comparison may not be complete since it does not globally consider the differences on each criterion between all the other patterns and the latter. That is, a traditional flow merely locally considers the difference on each criterion between two patterns. In contrast with traditional flows, the relationship-based flow is newly proposed by employing the grey relational analysis to assess the flow from one pattern to another pattern by considering the differences on each criterion between all the other patterns and the latter. A genetic algorithm-based learning algorithm is designed to determine the relative weights of respective criteria to derive the overall relationship index of a pattern. Our method is tested on several real-world data sets. Its performance is comparable to that of other well-known classifiers and flow-based methods.


Author(s):  
Lakshmi Prayaga ◽  
Krishna Devulapalli ◽  
Chandra Prayaga

Wearable devices are contributing heavily towards the proliferation of data and creating a rich minefield for data analytics. Recent trends in the design of wearable devices include several embedded sensors which also provide useful data for many applications. This research presents results obtained from studying human-activity related data, collected from wearable devices. The activities considered for this study were working at the computer, standing and walking, standing, walking, walking up and down the stairs, and talking while walking. The research entails the use of a portion of the data to train machine learning algorithms and build a model. The rest of the data is used as test data for predicting the activity of an individual. Details of data collection, processing, and presentation are also discussed. After studying the literature and the data sets, a Random Forest machine learning algorithm was determined to be best applicable algorithm for analyzing data from wearable devices. The software used in this research includes the R statistical package and the SensorLog app.


Author(s):  
Sotiris Kotsiantis ◽  
Dimitris Kanellopoulos ◽  
Panayotis Pintelas

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected tone can learn a way of classifying unseen examples (see Table 1). Formally, the problem can be stated as follows: Given training data {(x1, y1)…(xn, yn)}, produce a classifier h: X- >Y that maps an object x ? X to its classification label y ? Y. A large number of classification techniques have been developed based on artificial intelligence (logic-based techniques, perception-based techniques) and statistics (Bayesian networks, instance-based techniques). No single learning algorithm can uniformly outperform other algorithms over all data sets. The concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual machine learning algorithms. Numerous methods have been suggested for the creation of ensembles of classi- fiers (Dietterich, 2000). Although, or perhaps because, many methods of ensemble creation have been proposed, there is as yet no clear picture of which method is best.


Author(s):  
John Yearwood ◽  
Adil Bagirov ◽  
Andrei V. Kelarev

The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors’ experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors’ k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.


The rapid development of cloud computing, big data, machine learning and datamining made information technology and human society to enter new era of technology. Statistical and mathematical analysis on data given a new way of research on prediction and estimation using samples and data sets. Data mining is a mechanism that explores and analyzes many dis-organized or dis-ordered data to obtain potentially useful information and model it based on different algorithms. Machine learning is an iterative process rather than a linear process that requires each step to be revisited as more is learned about the problem. We discussed different machine learning algorithms that can manipulate data and analyses datasets based on best cases for accurate results. Design and Implementation of a framework that is associated with different machine learning algorithms. This paper expounds the definition, model, development stage, classification and commercial application of machine learning, and emphasizes the role of machine learning in data mining by deploying the framework. Therefore, this paper summarizes and analyzes machine learning technology, and discusses the use of machine learning algorithms in data mining. Finally, the mathematical analysis along with results and graphical analysis is given


2016 ◽  
Vol 41 (1) ◽  
Author(s):  
Manuel J. A. Eugster ◽  
Torsten Hothorn ◽  
Friedrich Leisch

Benchmark experiments are the method of choice to compare learning algorithms empirically. For collections of data sets, the empirical performance distributions of a set of learning algorithms are estimated, compared, and ordered. Usually this is done for each data set separately. The present manuscript extends this single data set-based approach to a joint analysis for the complete collection, the so called problem domain. This enablesto decide which algorithms to deploy in a specific application or to compare newly developed algorithms with well-known algorithms on established problem domains.Specialized visualization methods allow for easy exploration of huge amounts of benchmark data. Furthermore, we take the benchmark experiment design into account and use mixed-effects models to provide a formal statistical analysis. Two domain-based benchmark experiments demonstrate our methods: the UCI domain as a well-known domain when one is developing a new algorithm; and the Grasshopper domain as a domain where we want to find the  best learning algorithm for a prediction component in an enterprise application software system.


Sign in / Sign up

Export Citation Format

Share Document