Encyclopedia of Data Warehousing and Mining, Second Edition
Latest Publications


TOTAL DOCUMENTS

324
(FIVE YEARS 0)

H-INDEX

7
(FIVE YEARS 0)

Published By IGI Global

9781605660103, 9781605660110

Author(s):  
Ling Feng

The discovery of association rules from large amounts of structured or semi-structured data is an important data mining problem [Agrawal et al. 1993, Agrawal and Srikant 1994, Miyahara et al. 2001, Termier et al. 2002, Braga et al. 2002, Cong et al. 2002, Braga et al. 2003, Xiao et al. 2003, Maruyama and Uehara 2000, Wang and Liu 2000]. It has crucial applications in decision support and marketing strategy. The most prototypical application of association rules is market basket analysis using transaction databases from supermarkets. These databases contain sales transaction records, each of which details items bought by a customer in the transaction. Mining association rules is the process of discovering knowledge such as “80% of customers who bought diapers also bought beer, and 35% of customers bought both diapers and beer”, which can be expressed as “diaper ? beer” (35%, 80%), where 80% is the confidence level of the rule, and 35% is the support level of the rule indicating how frequently the customers bought both diapers and beer. In general, an association rule takes the form X ? Y (s, c), where X and Y are sets of items, and s and c are support and confidence, respectively. In the XML Era, mining association rules is confronted with more challenges than in the traditional well-structured world due to the inherent flexibilities of XML in both structure and semantics [Feng and Dillon 2005]. First, XML data has a more complex hierarchical structure than a database record. Second, elements in XML data have contextual positions, which thus carry the order notion. Third, XML data appears to be much bigger than traditional data. To address these challenges, the classic association rule mining framework originating with transactional databases needs to be re-examined.


Author(s):  
Bamshad Mobasher

In the span of a decade, the World Wide Web has been transformed from a tool for information sharing among researchers into an indispensable part of everyday activities. This transformation has been characterized by an explosion of heterogeneous data and information available electronically, as well as increasingly complex applications driving a variety of systems for content management, e-commerce, e-learning, collaboration, and other Web services. This tremendous growth, in turn, has necessitated the development of more intelligent tools for end users as well as information providers in order to more effectively extract relevant information or to discover actionable knowledge. From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident. Web mining (i.e. the application of data mining techniques to extract knowledge from Web content, structure, and usage) is the collection of technologies to fulfill this potential. In this article, we will summarize briefly each of the three primary areas of Web mining—Web usage mining, Web content mining, and Web structure mining— and discuss some of the primary applications in each area.


Author(s):  
Malcolm J. Beynonm

The seminal work of Zadeh (1965), namely fuzzy set theory (FST), has developed into a methodology fundamental to analysis that incorporates vagueness and ambiguity. With respect to the area of data mining, it endeavours to find potentially meaningful patterns from data (Hu & Tzeng, 2003). This includes the construction of if-then decision rule systems, which attempt a level of inherent interpretability to the antecedents and consequents identified for object classification (See Breiman, 2001). Within a fuzzy environment this is extended to allow a linguistic facet to the possible interpretation, examples including mining time series data (Chiang, Chow, & Wang, 2000) and multi-objective optimisation (Ishibuchi & Yamamoto, 2004). One approach to if-then rule construction has been through the use of decision trees (Quinlan, 1986), where the path down a branch of a decision tree (through a series of nodes), is associated with a single if-then rule. A key characteristic of the traditional decision tree analysis is that the antecedents described in the nodes are crisp, where this restriction is mitigated when operating in a fuzzy environment (Crockett, Bandar, Mclean, & O’Shea, 2006). This chapter investigates the use of fuzzy decision trees as an effective tool for data mining. Pertinent to data mining and decision making, Mitra, Konwar and Pal (2002) succinctly describe a most important feature of decision trees, crisp and fuzzy, which is their capability to break down a complex decision-making process into a collection of simpler decisions and thereby, providing an easily interpretable solution.


Author(s):  
Pasquale De Meo ◽  
Diego Plutino ◽  
Giovanni Quattrone ◽  
Domenico Ursino

In this chapter we present a system for the management of team building and team update activities in the current human resource management scenario. The proposed system presents three important characteristics that appear particularly relevant in this scenario. Firstly, it exploits a suitable standard to uniformly represent and handle expert skills. Secondly, it is highly distributed and, therefore, is well suited for the typical organization of the current job market where consulting firms are intermediating most job positions. Finally, it considers not only experts’ technical skills but also their social and organizational capabilities, as well as the affinity degree possibly shown by them when they worked together in the past.


Author(s):  
Barak Chizi ◽  
Lior Rokach ◽  
Oded Maimon

Dimensionality (i.e., the number of data set attributes or groups of attributes) constitutes a serious obstacle to the efficiency of most data mining algorithms (Maimon and Last, 2000). The main reason for this is that data mining algorithms are computationally intensive. This obstacle is sometimes known as the “curse of dimensionality” (Bellman, 1961). The objective of Feature Selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since Feature Selection reduces the dimensionality of the data, data mining algorithms can be operated faster and more effectively by using Feature Selection. In some cases, as a result of feature selection, the performance of the data mining method can be improved. The reason for that is mainly a more compact, easily interpreted representation of the target concept. The filter approach (Kohavi , 1995; Kohavi and John ,1996) operates independently of the data mining method employed subsequently -- undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. A sub-category of filter methods that will be refer to as rankers, are methods that employ some criterion to score each feature and provide a ranking. From this ordering, several feature subsets can be chosen by manually setting There are three main approaches for feature selection: wrapper, filter and embedded. The wrapper approach (Kohavi, 1995; Kohavi and John,1996), uses an inducer as a black box along with a statistical re-sampling technique such as cross-validation to select the best feature subset according to some predictive measure. The embedded approach (see for instance Guyon and Elisseeff, 2003) is similar to the wrapper approach in the sense that the features are specifically selected for a certain inducer, but it selects the features in the process of learning.


Author(s):  
Mohammad Al Hasan

The research on mining interesting patterns from transactions or scientific datasets has matured over the last two decades. At present, numerous algorithms exist to mine patterns of variable complexities, such as set, sequence, tree, graph, etc. Collectively, they are referred as Frequent Pattern Mining (FPM) algorithms. FPM is useful in most of the prominent knowledge discovery tasks, like classification, clustering, outlier detection, etc. They can be further used, in database tasks, like indexing and hashing while storing a large collection of patterns. But, the usage of FPM in real-life knowledge discovery systems is considerably low in comparison to their potential. The prime reason is the lack of interpretability caused from the enormity of the output-set size. For instance, a moderate size graph dataset with merely thousand graphs can produce millions of frequent graph patterns with a reasonable support value. This is expected due to the combinatorial search space of pattern mining. However, classification, clustering, and other similar Knowledge discovery tasks should not use that many patterns as their knowledge nuggets (features), as it would increase the time and memory complexity of the system. Moreover, it can cause a deterioration of the task quality because of the popular “curse of dimensionality” effect. So, in recent years, researchers felt the need to summarize the output set of FPM algorithms, so that the summary-set is small, non-redundant and discriminative. There are different summarization techniques: lossless, profile-based, cluster-based, statistical, etc. In this article, we like to overview the main concept of these summarization techniques, with a comparative discussion of their strength, weakness, applicability and computation cost.


Author(s):  
Alexander Thomasian

Data storage requirements have consistently increased over time. According to the latest WinterCorp survey (http://www/WinterCorp.com), “The size of the world’s largest databases has tripled every two years since 2001.” With database size in excess of 1 terabyte, there is a clear need for storage systems that are both cost effective and highly reliable. Historically, large databases are implemented on mainframe systems. These systems are large and expensive to purchase and maintain. In recent years, large data warehouse applications are being deployed on Linux and Windows hosts, as replacements for the existing mainframe systems. These systems are significantly less expensive to purchase while requiring less resources to run and maintain. With large databases it is less feasible, and less cost effective, to use tapes for backup and restore. The time required to copy terabytes of data from a database to a serial medium (streaming tape) is measured in hours, which would significantly degrade performance and decreases availability. Alternatives to serial backup include local replication, mirroring, or geoplexing of data. The increasing demands of larger databases must be met by less expensive disk storage systems, which are yet highly reliable and less susceptible to data loss. This article is organized into five sections. The first section provides background information that serves to introduce the concepts of disk arrays. The following three sections detail the concepts used to build complex storage systems. The focus of these sections is to detail: (i) Redundant Arrays of Independent Disks (RAID) arrays; (ii) multilevel RAID (MRAID); (iii) concurrency control and storage transactions. The conclusion contains a brief survey of modular storage prototypes.


Author(s):  
Chrisa Tsinaraki

Several consumer electronic devices that allow capturing digital multimedia content (like mp3 recorders, digital cameras, DVD camcorders, smart phones etc.) are available today. These devices have allowed both the amateur and the professional users to produce large volumes of digital multimedia material, which, together with the traditional media objects digitized recently (using scanners, audio and video digitization devices) form a huge distributed multimedia information source. The multimedia material that is available today is usually organized in independent multimedia information sources, developed on top of different software platforms. The Internet, the emergence of advanced network infrastructures that allow for the fast, efficient and reliable transmission of multimedia content and the development of digital multimedia content services on top of them form an open multimedia consumption environment. In this environment, the users access the multimedia material either through computers or through cheap consumer electronic devices that allow the consumption and management of multimedia content. The users of such an open environment need to be able to access the services offered by the different vendors in a transparent way and to be able to compose the different atomic services (like, for example, multimedia content filtering) into new, composite ones. In order to fulfill this requirement, interoperability between the multimedia content services offered is necessary. Interoperability is achieved, at the syntactic level, through the adoption of standards. At the semantic level, interoperability is achieved through the integration of domain knowledge expressed in the form of domain ontologies. An ontology is a logical theory accounting for the intended meaning of a formal vocabulary, i.e. its ontological commitment to a particular conceptualization of the world (Guarino, 1998). The standard that dominates in multimedia content description is the MPEG-7 (Salembier, 2001), formally known as Multimedia Content Description Interface. It supports multimedia content description from several points of view, including media information, creation information, structure, usage information, textual annotations, media semantics, and low-level visual and audio features. Since the MPEG-7 allows the structured description of the multimedia content semantics, rich and accurate semantic descriptions can be created and powerful semantic retrieval and filtering services can be built on top of them. It has been shown, in our previous research (Tsinaraki, Fatourou and Christodoulakis, 2003), that domain ontologies capturing domain knowledge can be expressed using pure MPEG-7 constructs. This way, domain knowledge can be integrated in the MPEG-7 semantic descriptions. The domain knowledge is subsequently utilized for supporting semantic personalization, retrieval and filtering and has been shown to enhance the retrieval precision (Tsinaraki, Polydoros and Christodoulakis, 2007). Although multimedia content description is now standardized through the adoption of the MPEG-7 and semantic multimedia content annotation is possible, multimedia content retrieval and filtering (especially semantic multimedia content retrieval and filtering), which form the basis of the multimedia content services, are far from being successfully standardized.


Author(s):  
Ping Deng ◽  
Qingkai Ma ◽  
Weili Wu

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.


Author(s):  
Minh Ngoc Ngo

Due to the need to reengineer and migrating aging software and legacy systems, reverse engineering has started to receive some attention. It has now been established as an area in software engineering to understand the software structure, to recover or extract design and features from programs mainly from source code. The inference of design and feature from codes has close similarity with data mining that extracts and infers information from data. In view of their similarity, reverse engineering from program codes can be called as program mining. Traditionally, the latter has been mainly based on invariant properties and heuristics rules. Recently, empirical properties have been introduced to augment the existing methods. This article summarizes some of the work in this area.


Sign in / Sign up

Export Citation Format

Share Document