Quantization of Continuous Data for Pattern Based Rule Extraction

Author(s):  
Andrew Hamilton-Wright ◽  
Daniel W. Stashuk

A great deal of interesting real-world data is encountered through the analysis of continuous variables, however many of the robust tools for rule discovery and data characterization depend upon the underlying data existing in an ordinal, enumerable or discrete data domain. Tools that fall into this category include much of the current work in fuzzy logic and rough sets, as well as all forms of event-based pattern discovery tools based on probabilistic inference. Through the application of discretization techniques, continuous data is made accessible to the analysis provided by the strong tools of discrete-valued data mining. The most common approach for discretization is quantization, in which the range of observed continuous valued data are assigned to a fixed number of quanta, each of which covers a particular portion of the range within the bounds provided by the most extreme points observed within the continuous domain. This chapter explores the effects such quantization may have, and the techniques that are available to ameliorate the negative effects of these efforts, notably fuzzy systems and rough sets.

2008 ◽  
Vol 2008 ◽  
pp. 1-11 ◽  
Author(s):  
Nicholas Holden ◽  
Alex A. Freitas

We have previously proposed a hybrid particle swarm optimisation/ant colony optimisation (PSO/ACO) algorithm for the discovery of classification rules. Unlike a conventional PSO algorithm, this hybrid algorithm can directly cope with nominal attributes, without converting nominal values into binary numbers in a preprocessing phase. PSO/ACO2 also directly deals with both continuous and nominal attribute values, a feature that current PSO and ACO rule induction algorithms lack. We evaluate the new version of the PSO/ACO algorithm (PSO/ACO2) in 27 public-domain, real-world data sets often used to benchmark the performance of classification algorithms. We compare the PSO/ACO2 algorithm to an industry standard algorithm PART and compare a reduced version of our PSO/ACO2 algorithm, coping only with continuous data, to our new classification algorithm for continuous data based on differential evolution. The results show that PSO/ACO2 is very competitive in terms of accuracy to PART and that PSO/ACO2 produces significantly simpler (smaller) rule sets, a desirable result in data mining—where the goal is to discover knowledge that is not only accurate but also comprehensible to the user. The results also show that the reduced PSO version for continuous attributes provides a slight increase in accuracy when compared to the differential evolution variant.


2000 ◽  
pp. 221-229
Author(s):  
Hiroyuki SAKAKIBARA ◽  
Kazumasa KURAMOTO ◽  
Hideaki KIKUCHI ◽  
Hirotaka NAKAYAMA ◽  
Hiromi TETSUGA ◽  
...  

Author(s):  
Wouter Koch ◽  
Peter Boer ◽  
Johannes IJ. Witte ◽  
Henk W. Van der Veer ◽  
David W. Thieltges

A conspicuous part of the parasite fauna of marine fish are ectoparasites, which attach mainly to the fins or gills. The abundant copepods have received much interest due to their negative effects on hosts. However, for many localities the copepod fauna of fish is still poorly known, and we know little about their temporal stability as long-term observations are largely absent. Our study provides the first inventory of ectoparasitic copepods on fish from the western Wadden Sea (North Sea) based on field data from 1968 and 2010 and additional unpublished notes. In total, 47 copepod parasite species have been recorded on 52 fish host species to date. For two copepod species parasitizing the European flounder (Platichthys flesus), a quantitative comparison of infection levels between 1968 and 2010 was possible. Whereas Acanthochondria cornuta did not show a change in the relationship between host size and infection levels, Lepeophtheirus pectoralis shifted towards the infection of smaller hosts, with higher infection levels in 2010 compared to 1968. These differences probably reflect the biology of the species and the observed decrease in abundance and size of flounders during the last decades. The skin-infecting L. pectoralis can probably compensate for dwindling host abundance by infecting smaller fish and increasing its abundance per given host size. In contrast, the gill cavity inhabiting A. cornuta probably faces a spatial constraint (fixed number of gill arches), thus limiting its abundance and setting a minimum for the host size necessary for infections.


Author(s):  
Sam Fletcher ◽  
Md Zahidul Islam

The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.


Author(s):  
Benjamin Griffiths

Rough Set Theory (RST), since its introduction in Pawlak (1982), continues to develop as an effective tool in data mining. Within a set theoretical structure, its remit is closely concerned with the classification of objects to decision attribute values, based on their description by a number of condition attributes. With regards to RST, this classification is through the construction of ‘if .. then ..’ decision rules. The development of RST has been in many directions, amongst the earliest was with the allowance for miss-classification in the constructed decision rules, namely the Variable Precision Rough Sets model (VPRS) (Ziarko, 1993), the recent references for this include; Beynon (2001), Mi et al. (2004), and Slezak and Ziarko (2005). Further developments of RST have included; its operation within a fuzzy environment (Greco et al., 2006), and using a dominance relation based approach (Greco et al., 2004). The regular major international conferences of ‘International Conference on Rough Sets and Current Trends in Computing’ (RSCTC, 2004) and ‘International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing’ (RSFDGrC, 2005) continue to include RST research covering the varying directions of its development. This is true also for the associated book series entitled ‘Transactions on Rough Sets’ (Peters and Skowron, 2005), which further includes doctoral theses on this subject. What is true, is that RST is still evolving, with the eclectic attitude to its development meaning that the definitive concomitant RST data mining techniques are still to be realised. Grzymala-Busse and Ziarko (2000), in a defence of RST, discussed a number of points relevant to data mining, and also made comparisons between RST and other techniques. Within the area of data mining and the desire to identify relationships between condition attributes, the effectiveness of RST is particularly pertinent due to the inherent intent within RST type methodologies for data reduction and feature selection (Jensen and Shen, 2005). That is, subsets of condition attributes identified that perform the same role as all the condition attributes in a considered data set (termed ß-reducts in VPRS, see later). Chen (2001) addresses this, when discussing the original RST, they state it follows a reductionist approach and is lenient to inconsistent data (contradicting condition attributes - one aspect of underlying uncertainty). This encyclopaedia article describes and demonstrates the practical application of a RST type methodology in data mining, namely VPRS, using nascent software initially described in Griffiths and Beynon (2005). The use of VPRS, through its relative simplistic structure, outlines many of the rudiments of RST based methodologies. The software utilised is oriented towards ‘hands on’ data mining, with graphs presented that clearly elucidate ‘veins’ of possible information identified from ß-reducts, over different allowed levels of missclassification associated with the constructed decision rules (Beynon and Griffiths, 2004). Further findings are briefly reported when undertaking VPRS in a resampling environment, with leave-one-out and bootstrapping approaches adopted (Wisnowski et al., 2003). The importance of these results is in the identification of the more influential condition attributes, pertinent to accruing the most effective data mining results.


Author(s):  
Longbing Cao ◽  
Chengqi Zhang

Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.


Author(s):  
Dries Verlet ◽  
Carl Devos

Although policy evaluation has always been important, today there is a rising attention for policy evaluation in the public sector. In order to provide a solid base for the so-called evidence-based policy, valid en reliable data are needed to depict the performance of organisations within the public sector. Without a solid empirical base, one needs to be very careful with data mining in the public sector. When measuring performance, several unintended and negative effects can occur. In this chapter, the authors focus on a few common pitfalls that occur when measuring performance in the public sector. They also discuss possible strategies to prevent them by setting up and adjusting the right measurement systems for performance in the public sector. Data mining is about knowledge discovery. The question is: what do we want to know? What are the consequences of asking that question?


Sign in / Sign up

Export Citation Format

Share Document